a dwarf-based scalable big data benchmarking methodology · big data benchmarking methodology,...
TRANSCRIPT
arX
iv:1
711.
0322
9v1
[cs
.AR
] 9
Nov
201
7
A Dwarf-based Scalable Big Data BenchmarkingMethodology
Wanling Gao1,2, Lei Wang1,2, Jianfeng Zhan ∗1,2, Chunjie Luo1,2, Daoyi Zheng1,2, Zhen Jia4, Biwei Xie1,2, ChenZheng1,2, Qiang Yang5,6, and Haibin Wang3
1State Key Laboratory of Computer Architecture (Institute of Computing Technology, Chinese Academy ofSciences)
{gaowanling, wanglei 2011, zhanjianfeng, luochunjie, zhengdaoyi, xiebiwei, zhengchen}@ict.ac.cn2University of Chinese Academy of Sciences3Huawei, [email protected] University, [email protected]
5Beijing Academy of Frontier Science & Technology, [email protected] Industry, Intelligence and Information Data Technology Corporation
ABSTRACT
Different from the traditional benchmarking methodol-ogy that creates a new benchmark or proxy for everypossible workload, this paper presents a scalable bigdata benchmarking methodology. Among a wide vari-ety of big data analytics workloads, we identify eightbig data dwarfs, each of which captures the commonrequirements of each class of unit of computation whilebeing reasonably divorced from individual implementa-tions. We implement the eight dwarfs on different soft-ware stacks, e.g., OpenMP, MPI, Hadoop as the dwarfcomponents. For the purpose of architecture simula-tion, we construct and tune big data proxy benchmarksusing the directed acyclic graph (DAG)-like combina-tions of the dwarf components with different weights tomimic the benchmarks in BigDataBench. Our proxybenchmarks preserve the micro-architecture, memory,and I/O characteristics, and they shorten the simu-lation time by 100s times while maintain the averagemicro-architectural data accuracy above 90 percentageon both X86 64 and ARMv8 processors. We will open-source the big data dwarf components and proxy bench-marks soon.
1. INTRODUCTION
Benchmarking is the foundation of system and ar-chitecture evaluation. The benchmark is run on sev-eral different systems, and the performance and priceof each system is measured and recorded [1]. The mostsimplest benchmarking methodology is to creates a newbenchmark for every possible workload. PARSEC [2],BigDataBench [3] and CloudSuite [4] are put togetherin this way. This methodology has two drawbacks.First, the modern real-world workloads change very fre-∗The corresponding author is Jianfeng Zhan.
Big Data Dwarfs
Big Data Dwarf Components
I/O Behaviors
Implementation on specific
runtime system
Big Data Proxy Benchmarks
Abstraction of frequently-appearing units of computation
Micro-architectural Behaviors
DAG-like combination with
different weights (Tuning)
Computation Logic
Pipeline Efficiency
Instruction Mix
Branch Prediction
Cache Behaviors
System Behaviors
Memory Access
Disk I/O Bandwidth
Speedup Behavior
Memory Access
Big Data Workloads
Mimic
Figure 1: Dwarf-based Scalable Big Data BenchmarkingMethodology.
quently, and we need keep expanding the benchmarkset with emerging workloads. So it is very difficult tobuild a comprehensive benchmark suite. Second, what-ever early in the design process or later in the systemevaluation, it is time-consuming to run a comprehensivebenchmark suite. The complex software stacks of themodern workloads aggravate this issue. For example,running big data benchmarks on the simulators needsprohibitively long execution. We use benchmarking scal-ability 1—which can be measured in terms of the cost
1Here, the meaning of scalable differs from scaleable. As oneof four properties of domain-specific benchmarks defined by
of building and running a comprehensive benchmarksuite—to denote this challenge.For scalability, proxy benchmarks, abstracted from
real workloads, are widely adopted. Kernel benchmarks—small programs extracted from large application pro-grams [5, 6]—are widely used in high performance com-puting. However, it is insufficient to completely re-flect workload behaviors and has limited usefulness inmaking overall comparisons [7, 5], especially for com-plex big data workloads. Another attempt is to gener-ate synthetic traces or benchmarks based on the work-load trace to mimic workload behaviors, and they pro-pose the methods like statistic simulation [8, 9, 10, 11]or sampling [12, 13, 14] to accelerate this trace. Itomits the computation logic of the original workloads,and has several disadvantages. First, trace-based syn-thetic benchmarks target specific architectures, and dif-ferent architecture configuration will generate differentsynthetics benchmarks. Second, synthetic benchmarkcannot be deployed across over different architectures,which makes it hard for cross-architecture comparisons.Third, workload trace has strong dependency on datainput, which make it hard to mimic real benchmarkwith diverse data inputs.Too complex workloads are not helpful for both repro-
ducibility and interpretability of performance data. Soidentifying abstractions of frequently-appearing units ofcomputation, which we call big data dwarfs, is an im-portant step toward fully understanding big data ana-lytics workloads. The concept of dwarf, which is firstlyproposed by Phil Colella [15], is thought to be not onlya highly abstraction of workload patterns, but also aminimum set of necessary functionality [16]. A dwarfcaptures the common requirements of each class of unitof computation while being reasonably divorced fromindividual implementations [17] 2. For example, trans-formation from one domain to another domain containsa class of unit of computation, such as fast fourier trans-form (FFT) and discrete cosine transform (DCT). In-tuitively, we can build big data proxy benchmarks onthe basis of big data dwarfs. Considering complexity,diversity and fast-changing characteristics of big dataanalytics workloads, we propose a dwarf-based scalablebig data benchmarking methodology, as shown in Fig. 1.After thoroughly analyzing a majority of workloads in
five typical big data application domains (search engine,social network, e-commerce, multimedia and bioinfor-matics), we identify eight big data dwarfs, includingmatrix, sampling, logic, transform, set, graph, sort andbasic statistic computations that frequently appear 3,the combinations of which describe most of big dataworkloads we investigated. We implement eight dwarfson different software stacks like MPI, Hadoop with di-verse data generation tools for text, matrix and graph
Jim Gray [1], the latter refers to scaling the benchmark upto larger systems2Our definition has a subtle difference from the definition inthe Berkeley report [17]3We acknowledge our eight dwarfs may be not enough forother applications we fail to investigate. We will keep onexpanding this dwarf set.
data, which we call the dwarf components. As softwarestack has great influences on workload behaviors [3,18], our dwarf component implementation considers theexecution model of software stacks and the program-ming styles of workloads. For feasible simulation, wechoose the OpenMP version of dwarf components toconstruct proxy benchmarks as OpenMP is much morelight-weight than the other programming frameworksin terms of binary size and also supported by a major-ity of simulators [19], such as GEM5 [20]. As a noderepresents original data set or an intermediate data setbeing processed and a edge represents the dwarf com-ponents, we use the DAG-like combination of one ormore dwarf components with different weights to formour proxy benchmarks. We provide an auto-tuning toolto generate qualified proxy benchmarks satisfying thesimulation time and micro-architectural data accuracyrequirements, through an iteratively execution includ-ing the adjusting and feedback stages.On the typical X86 64 and ARMv8 processors, the
proxy benchmarks have about 100s times runtime speedupwith the averagemicro-architectural data accuracy above90% with respect to the benchmarks from BigDataBench.Our proxy benchmarks have been applied to the ARMprocessor design and implementation in our industrypartnership. Our contributions are two-fold as follows:
• We propose a dwarf-based scalable big data bench-marking methodology, using a DAG-like combina-tion of the dwarf components with different weightsto mimic big data analytics workloads.
• We identify eight big data dwarfs, and implementthe dwarf components for each dwarf on differ-ent software stacks with diverse data inputs. Fi-nally we construct the big data proxy benchmarks,which reduces simulation time and guarantees per-formance data accuracy for micro-architectural sim-ulation.
The rest of the paper is organized as follows. Section2 presents the methodology. Section 3 performs evalu-ations on a five-node X86 64 cluster. In Section 4, wereport using the proxy benchmarks on the ARMv8 pro-cessor. Section 5 introduces the related work. Finally,we draw a conclusion in Section 6.
2. BENCHMARKING METHODOLOGY
2.1 Methodology Overview
In this section, we illustrate our dwarf-based scalablebig data benchmarking methodology, including dwarfidentification, dwarf component implementation, andproxy benchmark construction. Fig. 2 presents our bench-marking methodology.First, we analyze a majority of big data analytics
workloads through workload decomposition and con-clude frequently-appearing classes of units of computa-tion, as big data dwarfs. Second, we provide the dwarfimplementations using different software stacks, as the
Proxy Benchmark Construction
Big Data Dwarf and Dwarf Components
A majority of big data
analytics workloads
Extracting From Five
Application Domains
Workload Analysis
Units of Computation
Workload DecompositionFrequently!
appearing
classes of units
of computation
Dwarf
Components
(Dwarf Implementation)
Tracing and
ProfilingInitial Weights
Proxy
Benchmark
Qualified proxy
benchmarkAuto-tuning
Given a big data
analytics
workloads
1
2
Combining Metrics
Figure 2: Methodology Overview.
dwarf components. Finally, we construct proxy bench-marks using the DAG-like combinations of the dwarfcomponents with different weights to mimic the real-world workloads. A DAG-like structure uses a node torepresent original data set or intermediate data set be-ing processed, and uses a edge to represent the dwarfcomponents.Given a big data analytics workload, we get the run-
ning trace and execution time through the tracing andprofiling tools. According to the execution ratios, weidentify the hotspot functions and correlate them tothe code fragments of the workload through bottom-up analysis. Then we analyze these code fragments tochoose the specific dwarf components and set their ini-tial weights according to execution ratios. Based onthe dwarf components and initial weights, we constructour proxy benchmarks using a DAG-like combinationof dwarf components. For the targeted performancedata accuracy, such as cache behaviors or I/O behav-iors, we provide an auto-tuning tool to tune the param-eters of the proxy benchmark, and generate qualifiedproxy benchmark to satisfy the requirements.
2.2 Big Data Dwarf and Dwarf Components
In this section, we illustrate the big data dwarfs andcorresponding dwarf component implementations.After singling out a broad spectrum of big data ana-
lytics workloads (machine learning, data mining, com-puter vision and natural language processing) throughinvestigating five application domains (search engine,social network, e-commerce, multimedia, and bioinfor-matics), we analyze these workloads and decomposethem to multiple classes of units of computation. Ac-cording to their frequency and importance, we final-ize eight big data dwarfs, which are abstractions of fre-quently-appearing classes of units of computation. Ta-ble 1 shows the importance of eight classes of units ofcomputation (dwarfs) in a majority of big data analyt-ics workloads. We can find that these eight dwarfs aremajor classes of units of computations in a variety ofbig data analytics workloads.
2.2.1 Eight Big Data Dwarfs
In this subsection, we summarize eight big data dwarfsfrequently appearing in big data analytics workloads.Matrix Computations In big data analytics, many
problems involve matrix computations, such as matrixmultiplication and matrix transposition.Sampling Computations Sampling plays an essen-
tial role in big data processing, which can obtain an ap-proximate solution when one problem cannot be solvedby using analytical method.Logic Computations We name computations per-
forming bit manipulation as logic computations, suchas hash, data compression and encryption.Transform Computations The transform compu-
tations here mean the conversion from the original do-main (such as time) to another domain (such as fre-quency). Common transform computations include dis-crete fourier transform (DFT), discrete cosine transform(DCT) and wavelet transform.Set Computations In mathematics, set means a
collection of distinct objects. Likewise, the concept ofset is also widely used in computer science. For ex-ample, similarity analysis of two data sets involves setcomputations, such as Jaccard similarity. Furthermore,fuzzy set and rough set play very important roles incomputer science.Graph Computations A lot of applications involve
graphs, with nodes representing entities and edges rep-resenting dependencies. Graph computations are noto-rious for having irregular memory access patterns.Sort Computations Sort is widely used in many
areas. Jim Gray thought sort is the core of moderndatabases [17], which shows its fundamentality.Basic Statistic Computations Basic statistic com-
putations are used to obtain the summary informationthrough statistical computations, such as counting andprobability statistics.
2.2.2 Dwarf Components
Fig. 3 presents the overview of our dwarf components,which consist of two parts– data generation tools anddwarf implementations. The data generation tools pro-
Table 1: Eight Classes of Units of Computation.
Catergory Application Domain Workload Unit of Computation
Graph MiningSearch Engine
Community Detection
PageRank Matrix, Graph, Sort
BFS, Connected component(CC) Graph
Deminsion ReductionImage Processing
Text Processing
Principal components analysis(PCA) Matrix
Latent dirichlet allocation(LDA) Basic Statistic, Sampling
Deep LearningImage Recognition
Speech Recognition
Convolutional neural network(CNN) Matrix, Sampling, Transform
Deep belief network(DBN) Matrix, Sampling
RecommendationAssociation Rules Mining
Electronic Commerce
Aporiori Basic Statistic, Set
FP-Growth Graph, Set, Basic Statistic
Collaborative filtering(CF) Graph, Matrix
Classification
Image Recognition
Speech Recognition
Text Recognition
Support vector machine(SVM) Matrix
K-nearest neighbors(KNN) Matrix, Sort, Basic Staticstic
Naive bayes Basic Statistic
Random forest Graph, Basic Statistic
Decision tree(C4.5/CART/ID3) Graph, Basic Statistic
Clustering Data Mining K-means Matrix, Sort
Feature Preprocess
Image Processing
Signal Processing
Text Processing
Image segmentation(GrabCut) Matrix, Graph
Scale-invariant feature trans-form(SIFT)
Matrix, Transform, Sampling,Sort, Basic Statistic
Image Transform Matrix, Transform
Term Frequency-inverse documentfrequency (TF-IDF)
Basic Statistic
Sequence TaggingBioinformatics
Language Processing
Hidden Markov Model(HMM) Matrix
Conditional random fields(CRF) Matrix, Sampling
Indexing Search Engine Inverted index, Forward index Basic Statistic, Logic, Set,Sort
Encoding/Decoding
Multimedia Processing
Security
Cryptography
Digital Signature
MPEG-2 Matrix, Transform
Encryption Matrix, Logic
SimHash, MinHash Set, Logic
Locality-sensitive hashing(LSH) Set, Logic
Data Warehouse Business intelligence Project, Filter, OrderBy, Union Set, Sort
vide various data inputs with different data types anddistributions to the dwarf components, covering text,graph and matrix data. Since software stack has greatinfluences on workload behaviors [3, 18], our dwarf com-ponent implementation considers the execution modelof software stacks and the programming styles of work-loads using specific software stacks. Fig. 3 lists all dwarfcomponents. For example, we provide distance calcu-lation (i.e. euclidian, cosine) and matrix multiplicationfor matrix computations. For simulation-based archi-tecture research, we provide light-weight implementa-tion of dwarf components using the OpenMP [21] frame-work as OpenMP has several advantages. First, it iswidely used. Second, it is much more light-weight thanthe other programming frameworks in terms of binarysize. Third, it is supported by a majority of simula-tors [19], such as GEM5 [20]. For the purpose of systemevaluation, we also implemented the dwarf componentson several other software stacks including MPI [22],Hadoop [23] and Spark [24].
Dwarf Components
Dat
a G
ener
atio
n(T
ypes
& D
istr
ibuti
on)
Set Computations
Graph Computations
Sort Computations
Basic Statistic Computations
Matrix Computations
Sampling Computations
Logic Computations
Transform Computations
Distance
calculation
Matrix
multiplication
Random
sampling
Interval
sampling
Union
Intersection
Difference
Graph
construct
Graph
traversal
Quick sort
Merge sort
Count/average
statistics
Probability
statistics
MD5 hash FFT / IFFT
Input DCTEncryption
Max/Min
Figure 3: The Overview of the Dwarf Components.
2.3 Proxy Benchmarks Construction
Fig. 4 presents the process of proxy benchmark con-struction, including decomposing process and auto-tuningprocess. Given a big data analytics workload, we ob-tain its hotspot functions and execution time through amulti-dimensional tracing and profiling method, includ-
Decomposing Auto-Tuning
Accuracy Evaluation
Tuned
Param
eters
Qualified Proxy
Benchmark
Deviationanalysis
Impact analysis
• Input data size
• Weight
• Parallelism degree
• Chunk size
Feedback
Yes
No
Adjusting Stage Feedback Stage
Big Data Analytics Workloads
Metrics: • System metrics
• Micro-architectural
metrics
Pro
xy B
ench
mark
Dwarf components
Initial Weights
Param
eter In
itialization
Figure 4: Proxy Benchmarks Construction.
ing runtime tracing (e.g. JVM tracing and logging),system profiling (e.g. CPU time breakdown) and hard-ware profiling (e.g. CPU cycle breakdown). Based onthe hotspot analysis, we correlate the hotspot functionsto the code fragments of the workload and choose thecorresponding dwarf components through analyzing thecomputation logic of the code fragments. Our proxybenchmark is a DAG-like combination of the selecteddwarf components with initial weights setting by theirexecution ratios.We measure the proxy benchmark’s accuracy by com-
paring the performance data of the proxy benchmarkwith those of the original workloads in both system andmicro-architecture level. In our methodology, we canchoose different performance metrics to measure the ac-curacy according to the concerns about different behav-iors of the workload. For example, if our proxy bench-marks are to focus on cache behaviors of the workload,we can choose the metrics that reflect cache behaviorslike cache hit rate to tune a qualified proxy benchmark.To tune the accuracy—making it more similar to theoriginal workload, we further provide an auto-tuningtool with four parameters to be tuned (listed in Table2). We found those four parameters play essential rolesin the system and architectural behaviours of a certainworkload. The tuning process is an iteratively execu-tion process which can be separated into three stages:parameter initialization, adjusting stage and feedbackstage. We elaborate them in detail as below:Parameter Initialization We initialize the four pa-
rameters(Input Data Size, Chunk Size, Parallelism De-gree and Weight) according to the configuration of theoriginal workload. We scale down the input data setand chunk size of the original workloads to initialize In-put Data Size and Chunk Size. The Parallelism Degreeis initialized as the parallelism degree of the originalworkload. To guarantee the rationality of the weightsof each dwarf component, we initialize the weights pro-portional to their corresponding execution ratios. Forexample, in Hadoop TeraSort, the weight is 70% of sortcomputation, 10% of sampling computation, and 20% ofgraph computation, respectively. Note that the weightcan be adjusted within a reasonable range (e.g. plus orminus 10%) during the tuning process.Adjusting Stage
Table 2: Tunable Parameters for Each Dwarf Compo-nent.
Parameter Description
Input Data Size The input data size for eachdwarf component
Chunk Size The data block size processed byeach thread for each dwarf com-ponent
Parallelism Degree The process and thread numbersfor each dwarf component
Weight The contribution for each dwarfcomponent
We introduce a decision tree based mechanism to as-sist the auto-tuning process. The tool learns the im-pact of each parameter on all metrics and build a deci-sion tree through Impact analysis. The learning processchanges one parameter each time and execute multipletimes to characterize the parameter’s impact on eachmetric. Based on the impact analysis, the tool buildsa decision tree to determine which parameter to tuneif one metric has large deviation. After that, the Feed-back Stage is activated to evaluate the proxy benchmarkwith tuned parameters. If it does not satisfy the require-ments, the tool will adjust the parameters to improvethe accuracy using the decision tree in the adjustingstage.Feedback Stage In the feedback stage, the tool eval-
uates the accuracy of the current proxy benchmark withspecific parameters. If the deviations of all metrics arewithin the setting range (e.g. 15%), the auto-tuningprocess is finished. Otherwise, the metrics with largedeviations will be fed back to the adjusting stage. Theadjusting and feedback process will iterate until reach-ing the specified accuracy, and the finalized proxy bench-mark with the final parameter settings is our qualifiedproxy benchmark.
2.4 Proxy Benchmarks Implementation
Considering the mandatory requirements of simula-tion time and performance data accuracy for architec-ture community, we implement four proxy benchmarkswith respect to four representative Hadoop benchmarks
Table 3: Four Hadoop Benchmarks from BigDataBench and Their Corresponding Proxy Benchmarks.
Big DataBenchmark
Workload Patterns Data Set Involved Dwarfs Involved Dwarf Components
HadoopTeraSort
I/O Intensive TextSort computationsSampling computationsGraph computations
Quick sort; Merge sortRandom sampling; Interval samplingGraph construction; Graph traversal
HadoopKmeans
CPU Intensive VectorsMatrix computationsSort computationsBasic Statistic
Vector euclidean distance; Cosine distanceQuick sort; Merge sortCluster count; Average computation
HadoopPageRank
Hybrid GraphMatrix computationsSort computationsBasic Statistic
Matrix construction; Matrix multiplicationQuick sort; Min/max calculationOut degree and in degree count of nodes
HadoopSIFT
CPU IntensiveMemory Intensive
Image
Matrix computationsSort computationsSampling computationsTransform computationsBasic Statistic
Matrix construction; Matrix multiplicationQuick sort; Min/max calculationInterval samplingFFT/IFFT TransformationCount Statistics
from BigDataBench [3] – TeraSort, Kmeans, PageRank,and SIFT, according to our benchmarking methodol-ogy. At the requests of our industry partners, we imple-mented the four proxy benchmarks in advance becauseof the following reasons. Other proxy benchmarks arebeing built using the same methodology.Representative Application Domains They are
all widely used in many important application domains.For example, TeraSort is a widely-used workload inmany application domains; PageRank is a famous work-load for search engine; Kmeans is a simple but usefulworkload used in internet services; SIFT is a fundamen-tal workload for image feature extraction.Various Workload Patterns They have different
workload patterns. Hadoop TeraSort is an I/O-intensiveworkload; Hadoop Kmeans is a CPU-intensive work-load; Hadoop PageRank is a hybrid workload whichfalls between CPU-intensive and I/O-intensive; HadoopSIFT is a CPU-intensive and memory-intensive work-load.Diverse Data Inputs They take different data in-
puts. Hadoop TeraSort uses text data generated bygensort [25]; Hadoop Kmeans uses vector data whileHadoop PageRank uses graph data; Hadoop SIFT usesimage data from ImageNet [26]. These benchmarks areof great significance for measuring big data systems andarchitectures [18].In the rest of this paper, we use Proxy TeraSort,
Proxy Kmeans, Proxy PageRank and Proxy SIFT torepresent the proxy benchmark for Hadoop TeraSort,Hadoop Kmeans, Hadoop PageRank and Hadoop SIFTfrom BigDataBench, respectively. The input data toeach proxy benchmark has the same data type and dis-tribution with respect to those of the Hadoop bench-marks so as to preserve the data impact on workload be-haviors. To resemble the process of the Hadoop frame-work, we choose the OpenMP implementations of thedwarf components, which are implemented with simi-lar processes to the Hadoop framework, including in-put data partition, chunk data allocation per thread,
intermediate data output to disk, and data combina-tion. JVM garbage collection (GC) is an importantstep for automatic memory management. Currently,we don’t consider GC’s impacts in the proxy bench-mark construction, as GC only occupies a small frac-tion in the Hadoop workloads (about 1% time spenton GC in our experiments) when the node has enoughmemory. Finally, we use the auto-tuning tool to gen-erate the proxy benchmarks. The process of construct-ing these four proxy benchmarks shows that our auto-tuning method can reach a steady state after dozensof iterations. Table 3 lists the benchmark details fromthe perspectives of data set, involved dwarfs, involveddwarf components.
3. EVALUATION
In this section, we evaluate our proxy benchmarksfrom the perspectives of runtime speedup and accuracy.
3.1 Experiment Setups
We deploy a five-node cluster, with one master nodeand four slave nodes. They are connected using 1Gbethernet network. Each node is equipped with two In-tel Xeon E5645 (Westmere) processors, and each pro-cessor has six physical out-of-order cores. The memoryof each node is 32GB. Each node runs Linux CentOS6.4 with the Linux kernel version 3.11.10. The JDKand Hadoop versions are 1.7.0 and 2.7.1, respectively.The GCC version is 4.8.0, which supports the OpenMP3.1 specification. The proxy benchmarks are compiledusing ”-O2”option for optimization. The hardware andsoftware details are listed on Table 4.To evaluate the performance data accuracy, we run
the proxy benchmarks against the benchmarks fromBigDataBench. We run the four Hadoop benchmarksfrom BigDataBench on the above five-node cluster us-ing the optimized Hadoop configurations, through tun-ing the data block size of the Hadoop distributed filesystem, memory allocation for each map/reduce joband reduce job numbers according to the cluster scale
Table 4: Node Configuration Details of Xeon E5645
Hardware Configurations
CPU Type Intel CPU Core
Intel R©Xeon E5645 6 [email protected] DCache L1 ICache L2 Cache L3 Cache
6 × 32 KB 6 × 32 KB 6 × 256 KB 12MB
Memory 32GB,DDR3
Disk SATA@7200RPM
Ethernet 1GbHyper-Threading Disabled
Software Configurations
OperatingSystem
LinuxKernel
JDKVersion
HadoopVersion
CentOS 6.4 3.11.10 1.7.0 2.7.1
and memory size. For Hadoop TeraSort, we choose 100GB text data produced by gensort [25]. For HadoopKmeans and PageRank, we choose 100 GB sparse vectordata with 90% sparsity 4 and 226-vertex graph both gen-erated by BDGS [27], respectively. For Hadoop SIFT,we use one hundred thousand images from ImageNet [26].For comparison, we run the four proxy benchmarks onone of the slave nodes, respectively.
3.2 Metrics Selection and Collection
To evaluate accuracy, we choose micro-architecturaland system metrics covering instruction mix, cache be-havior, branch prediction, processor performance, mem-ory bandwidth and disk I/O behavior. Table 5 presentsthe metrics we choose.Processor Performance. We choose two metrics
to measure the processor overall performance. Instruc-tions per cycle (IPC) indicates the average number ofinstructions executed per clock cycle. Million instruc-tions per second (MIPS) indicates the instruction exe-cution speed.Instruction Mix. We consider the instruction mix
breakdown including the percentage of integer instruc-tions, floating-point instructions, load instructions, storeinstructions and branch instructions.Branch Prediction. Branch predication is an im-
portant strategy used in modern processors. We trackthe miss prediction ratio of branch instructions (br missfor short).Cache Behavior. We evaluate cache efficiency using
cache hit ratios, including L1 instruction cache, L1 datacache, L2 cache and L3 cache.Memory Bandwidth. We measure the data load
rate from memory and the data store rate into memory,with the unit of bytes per second. We choose metrics ofmemory read bandwidth (read bw for short), memorywrite bandwidth (write bw for short) and total memorybandwidth including both read and write (mem bw forshort).Disk I/O Behavior. We employ I/O bandwidth to
4The sparsity of the vector indicates the proportion of zero-valued elements.
reflect the I/O behaviors of workloads.We collect micro-architectural metrics from hardware
performance monitoring counters (PMCs), and look upthe hardware events’ value on Intel Developer’s Man-ual [28]. Perf [29] is used to collect these hardwareevents. To guarantee the accuracy and validity, we runeach workload three times, and collect performance dataof workloads on all slave nodes during the whole run-time. We report and analyze their average value.
Table 5: System and Micro-architectural Metrics.
Category Metric Name DescriptionMicro-architectural Metrics
ProcessorPerformance
IPC Instructions per cycle
MIPSMillion instructionsper second
InstructionMix
Instructionratios
Ratios of floating-point, load, store,branch and integerinstructions
BranchPrediction
Branch Miss Branch miss predic-tion ratio
CacheBehavior
L1I Hit Ratio L1 instruction cachehit ratio
L1D Hit Ratio L1 data cache hit ratioL2 Hit Ratio L2 cache hit ratioL3 Hit Ratio L3 cache hit ratio
System Metrics
MemoryBandwidth
ReadBandwidth
Memory load band-width
WriteBandwidth
Memory store band-width
TotalBandwidth
memory load and storebandwidth
Disk I/OBehavior
Disk I/OBandwidth
Disk read and writebandwidth
3.3 Runtime Speedup
Table 6 presents the execution time of the Hadoopbenchmarks and the proxy benchmarks on Xeon E5645.Hadoop TeraSort with 100 GB text data runs 1500 sec-onds on the five-node cluster. Hadoop Kmeans with100 GB vectors runs 5971 seconds for each iteration.Hadoop PageRank with 226-vertex graph runs 1443 sec-onds for each iteration. Hadoop SIFT with one hundredthousands images runs 721 seconds. The four corre-sponding proxy benchmarks run about ten seconds onthe physical machine. For TeraSort, Kmeans, PageR-ank, SIFT, the speedup is 136X (1500/11.02), 743X(5971/8.03), 160X (1444/9.03) and 90X (721/8.02), re-spectively.
3.4 Accuracy
We evaluate the accuracy of all metrics listed in Ta-ble 5. For each metric, the accuracy of the proxy bench-mark comparing to the Hadoop benchmark is computedby Equation 1. Among which, V alH represents the av-erage value of the Hadoop benchmark on all slave nodes;V alP represents the average value of the proxy bench-mark on a slave node. The absolute value ranges from 0
Table 6: Execution Time on Xeon E5645.
WorkloadsExecution Time (Second)
Hadoop version Proxy versionTeraSort 1500 11.02
Kmeans 5971 8.03
PageRank 1444 9.03
SIFT 721 8.02
to 1. The number closer to 1 indicates higher accuracy.
Accuracy(V alH , V alP ) = 1−
∣
∣
∣
∣
V alP − V alH
V alH
∣
∣
∣
∣
(1)
Fig. 5 presents the system and micro-architecturaldata accuracy of the proxy benchmarks on Xeon E5645.We can find that the average accuracy of all metrics aregreater than 90%. For TeraSort, Kmeans, PageRank,SIFT, the average accuracy is 94%, 91%, 93%, 94%, re-spectively. Fig. 6 shows the instruction mix breakdownof the proxy benchmarks and Hadoop benchmarks. FromFig. 6, we can find that the four proxy benchmarks pre-serve the instruction mix characteristics of these fourHadoop benchmarks. For example, the integer instruc-tion occupies 44% for Hadoop TeraSort and 46% forProxy TeraSort, while the floating-point instruction oc-cupies less than 1% for both Hadoop and Proxy Tera-Sort. For instructions involving data movement, HadoopTeraSort contains 39% of load and store instructions,and Proxy TeraSort contains 37%. The SIFT workload,widely used in computer vision for image processing,has many floating-point instructions. Also, Proxy SIFTpreserves the instruction mix characteristics of HadoopSIFT.
Figure 5: System and Micro-architectural Data Accu-racy on Xeon E5645.
3.4.1 Disk I/O Behaviors
Big data applications have significant disk I/O pres-sures. To evaluate the DISK I/O behaviors of the proxybenchmarks, we compute the disk I/O bandwidth us-ing Equation 2, where SectorRead+Write means the to-tal number of sector reads and sector writes; SizeSector
means the sector size (512 bytes for our nodes).
Figure 6: Instruction Mix Breakdown on Xeon E5645.
BWDiskI/O =(SectorRead+Write) ∗ SizeSector
RunT ime(2)
Fig. 7 presents the I/O bandwidth of proxy bench-marks and Hadoop benchmarks on Xeon E5645. Wecan find that they have similar average disk I/O pres-sure. The disk I/O bandwidth of Proxy TeraSort andHadoop TeraSort is 32.04 MB and 33.99 MB per second,respectively.
Figure 7: Disk I/O BandWidth on Xeon E5645.
3.4.2 The Impact of Input Data
In this section, we demonstrate when we change theinput data sparsity, our proxy benchmarks can still mimicthe big data workloads with a high accuracy. We runHadoop Kmeans with the same Hadoop configurations,and the data input is 100 GB. We use different datainput: sparse vector (the original configuration, 90%elements are zero) and dense vectors (all elements arenon-zero, and 0% elements are zero). Fig. 8 presentsthe performance difference using different data input.We can find that the memory bandwidth with sparsevectors is nearly half of the memory bandwidth withdense vectors, which confirms the data input’s impactson micro-architectural performance.On the other hand, Fig. 9 presents the accuracy of
proxy benchmark using diverse input data. We can findthat the average micro-architectural data accuracy ofProxy Kmeans is above 91% with respect to the fully-distributed Hadoop Kmeans using dense input data withno zero-valued element. When we change the input datasparsity from 0% to 90%, the data accuracy of ProxyKmeans is also above 91% with respect to the originalworkload. So we see that the Proxy Kmeans can mimicthe Hadoop Kmeans under different input data.
Figure 8: Data Impact on Memory Bandwidth UsingSparse and Dense Data for Hadoop Kmeans on XeonE5645.
Figure 9: System and Micro-architectural Data Accu-racy Using Different Data Input on Xeon E5645.
4. CASE STUDIES ON ARM PROCESSORS
This section demonstrates our proxy benchmark alsocan mimic the original benchmarks on the ARMv8 pro-cessors. We report our joint evaluation work with ourindustry partnership on ARMv8 processors using HadoopTeraSort, Kmeans, PageRank, and the correspondingproxy benchmarks. Our evaluation includes widely ac-ceptable metrics: runtime speedup, performance accu-racy and several other concerns like multi-core scalabil-ity and cross-platform speedup.
4.1 Experiment Setup
Due to the resource limitation of ARMv8 processors,we use a two-node (one master and one slave) clusterwith each node equipped with one ARMv8 processor.In addition, we deploy a two-node (one master and oneslave) cluster with each node equipped with one XeonE5-2690 v3 (Haswell) processor for speedup compari-son. Each ARMv8 processor has 32 physical cores, witheach core having independent L1 instruction cache andL1 data cache. Every four cores share L2 cache and allcores share the last-level cache. The memory of eachnode is 64GB. Each Haswell processor has 12 physicalcores, with each core having independent L1 and L2cache. All cores share the last-level cache. The mem-ory of each node is 64GB. In order to narrow the gap oflogical core numbers between two architectures, we en-able hyperthreading for Haswell processor. Table 7 liststhe hardware and software details of two platforms.Considering the memory size of the cluster, we use 50
GB text data generated by gensort for Hadoop Tera-
Table 7: Platform Configurations
Hardware ConfigurationsModel ARMv8 Xeon E5-2690 V3Number of Pro-cessors
1 1
Number of Cores 32 12Frequency 2.1GHz 2.6GHzL1 Cache(I/D) 48KB/32KB 32KB/32KBL2 Cache 8 x 1024KB 12 x 256KBL3 Cache 32MB 30MBArchitecture ARM X86 64Memory 64GB, DDR4 64GB, DDR4Ethernet 1Gb 1GbHyper-Threading None Enabled
Software ConfigurationsOperatingSystem
EulerOS V2.0Red-hat Enterprise LinuxServer release 7.0
Linux Kernel 4.1.23-vhulk3.6.3.aarch64
3.10.0-123.e17.x86-64
GCC Version 4.9.3 4.8.2JDK Version jdk1.8.0 101 jdk1.7.0 79Hadoop Version 2.5.2 2.5.2
Sort, 50 GB dense vectors for Hadoop Kmeans, and224-vertex graph data for Hadoop PageRank. We runHadoop benchmarks with optimized configurations, thr-ough tuning the data block size of the Hadoop dis-tributed file system, memory allocation for each job andreduce task numbers according to the cluster scale andmemory size. For comparison, we run proxy bench-marks on the slave node. Our industry partnershippays great attentions on cache and memory access pat-terns, which are important micro-architectural and sys-tem metrics for chip design. So we mainly collect cache-related and memory-related performance data.
4.2 Runtime Speedup on ARMv8
Table. 8 presents the execution time of Hadoop bench-marks and the proxy benchmarks on ARMv8. Ourproxy benchmarks run within 10 seconds on ARMv8processor. On two-node cluster equipped with ARMv8processor, Hadoop TeraSort with 50 GB text data runs1378 seconds. Hadoop Kmeans with 50GB vectors runs3374 seconds for one iteration. Hadoop PageRank with224-vertex runs 4291 seconds for five iterations. In con-trast, their proxy benchmarks run 4102, 8677 and 6219milliseconds, respectively. For TeraSort, Kmeans, PageR-ank, the speedup is 336X (1378/4.10), 386X (3347/8.68)and 690X (4291/6.22), respectively.
Table 8: Execution Time on ARMv8.
WorkloadsExecution Time (Second)
Hadoop version Proxy versionTeraSort 1378 4.10Kmeans 3347 8.68PageRank 4291 6.22
4.3 Accuracy on ARMv8
We report the system and micro-architectural dataaccuracy of the Hadoop benchmarks and the proxy bench-
marks. Likewise, we evaluate the accuracy by Equa-tion 1. Fig. 10 presents the accuracy of the proxybenchmarks on ARM processor. We can find that onthe ARMv8 processor, the average data accuracy is allabove 90%. For TeraSort, Kmeans and PageRank, theaverage accuracy is 93%, 95% and 92%, respectively.
Figure 10: System and Micro-architectural Data Accu-racy on ARMv8.
4.4 Multi-core Scalability on ARMv8
ARMv8 has 32 physical cores, and we evaluate itsmulti-core scalability using the Hadoop benchmarks andthe proxy benchmarks on 4, 8, 16, 32 cores, respectively.For each experiment, we disable the specified numberof cpu cores through cpu-hotplug mechanism. For theHadoop benchmarks, we adjust the Hadoop configura-tions so as to get the peak performance. For the proxybenchmarks, we run them directly without any modifi-cation.Fig. 11 reports multi-core scalability in terms of run-
time and MIPS. The horizontal axis represents the corenumber and the vertical axis represents runtime or MIPS.Due to the large runtime gap between the Hadoop bench-marks and proxy benchmarks, we list their runtime ondifferent side of vertical axis: the left side indicates run-time of the Hadoop benchmarks, while the right side in-dicates runtime of the proxy benchmarks. We can findthat they have similar multi-core scalability trends interms of both runtime and instruction execution speed.
4.5 Runtime Speedup across Different Proces-sors
Runtime speedup across different processors is an-other metric with much concern from our industry part-nership. In order to reflect the impacts of differentdesign decisions for running big data analytics work-loads on X86 and ARM architectures, the proxy bench-marks should be able to maintain consistent perfor-mance trends. That is to say, if the proxy benchmarksgain performance promotion through an improved de-sign, the real workloads can also benefit from the de-sign. We evaluate the runtime speedup across two dif-ferent architectures of ARMv8 and Xeon E5-2690 V3(Haswell). The runtime speedup is computed usingEquation 3. The Hadoop configurations are also opti-mized according to hardware environments. The proxybenchmarks use the same version on two architectures.
Speedup(T imeX86 64, T imeARM) =T imeARM
T imeX86 64(3)
Fig. 12 shows the runtime speedups of the Hadoopbenchmarks and the proxy benchmarks across ARMand X86 64 architectures. We can find that they haveconsistent speedup trends. For example, Hadoop Tera-Sort runs 1378 seconds and 856 seconds on ARMv8 andHaswell, respectively. Proxy TeraSort runs 4.1 secondsand 2.56 seconds on ARMv8 and Haswell, respectively.The runtime speedups between ARMv8 and Haswellare 1.61 (1378/856) running Hadoop TeraSort, and 1.60(4.1/2.56) running Proxy TeraSort.
5. RELATED WORK
Multiple benchmarking methodologies have been pro-posed over the past few decades. The most simplest oneis to create a new benchmark for every possible work-load. PARSEC [2] provides a series of shared-memoryprograms for chip-multiprocessors. BigDataBench [3] isa benchmark suite providing dozens of big data work-loads. CloudSuite [4] consists of eight applications, whichare selected based on popularity. These benchmarkingmethods need to provide individual implementations forevery possible workload, and keep expanding bench-mark set to cover emerging workloads. Moreover, itis frustrating to run (component or application) bench-marks like BigDataBench or CloudSuite on simulatorsbecause of complex software stacks and long runningtime. Using reduced data input is one way to reduceexecution time. Previous work [30, 31] adopts reduceddata set for the SPEC benchmark and maintains simi-lar architecture behaviors using the full reference datasets.Kernel benchmarks are widely used in high perfor-
mance computing. Livermore kernels [32] use Fortranapplications to measure floating-point performance range.The NAS parallel benchmarks [7] consist of several sep-arate tests, including five kernels and three pseudo-applications derived from computational fluid dynamics(CFD) applications. Linpack [33] provides a collectionof Fortran subroutines. Kernel benchmarks are insuf-ficient to completely reflect workload behaviors consid-ering the complexity and diversity of big data work-loads[7, 5].In terms of micro-architectural simulation, many pre-
vious studies generated synthetic benchmarks as prox-ies [34, 35]. Statistical simulation [12, 13, 14, 36, 37,38] generates synthetic trace or synthetic benchmarks tomimic micro-architectural performance of long-runningreal workloads, which targets one workload on a specificarchitecture with the certain configurations, and thuseach benchmark needs to be generated on the other ar-chitectures with different configurations [39]. Sampledsimulation selects a series of sample units for simula-tion instead of entire instruction stream, which weresampled randomly [8], periodically [9, 10] or based onphase behavior [11]. Seongbeom et al. [40] acceleratedthe full-system simulation through characterizing and
(a) TeraSort (b) Kmeans (c) PageRank
Figure 11: Multi-core Scalability of the Hadoop benchmarks and Proxy Benchmarks on ARMv8.
Figure 12: Runtime Speedup across Different Proces-sors.
predicting the performance behavior of OS services. Foremerging big data workloads, PerfProx [41] proposed aproxy benchmark generation framework for real-worlddatabase applications through characterizing low-leveldynamic execution characteristics.Our big data dwarfs are inspired by previous suc-
cessful abstractions in other application scenarios. Theset concept in relational algebra [42] abstracted fiveprimitive and fundamental operators (Select, Project,Product, Union, Difference), setting off a wave of re-lational database research. The set abstraction is thebasis of relational algebra and theoretical foundationof database. Phil Colella [15] identified seven dwarfsof numerical methods which he thought would be im-portant for the next decade. Based on that, a multi-disciplinary group of Berkeley researchers proposed 13dwarfs which were highly abstractions of parallel com-puting, capturing the computation and communicationpatterns of a great mass of applications [17]. NationalResearch Council proposed seven major tasks in mas-sive data analysis [43], which they called giants. Theseseven giants are macroscopical definition of problemsfrom the perspective of mathematics, while our eightclasses of dwarfs are frequently-appearing units of com-
putation in the above tasks and problems.
6. CONCLUSIONS
In this paper, we propose a dwarf-based scalable bigdata benchmarking methodology. After thoroughly an-alyzing a majority of algorithms in five typical applica-tion domains: search engine, social networks, e-commerce,multimedia, and bioinformatics, we capture eight bigdata dwarfs among a wide variety of big data analty-ics workloads, including matrix, sampling, logic, trans-form, set, graph, sort and basic statistic computation.For each dwarf, we implement the dwarf components us-ing diverse software stacks. We construct and tune thebig data proxy benchmarks using the DAG-like combi-nations of dwarf components using different wights tomimic the benchmarks in BigDataBench.Our proxy benchmarks can reach 100s runtime speedup
with respect to the benchmarks from BigDataBench,while the average micro-architectural data accuracy isabove 90% on both X86 64 and ARM architectures.The proxy benchmarks have been applied to ARM pro-cessor design in our industry partnership.
7. REFERENCES
[1] J. Gray, “Database and transaction processing performancehandbook.,” 1993.
[2] C. Bienia, Benchmarking Modern Multiprocessors. PhDthesis, Princeton University, January 2011.
[3] L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He,W. Gao, Z. Jia, Y. Shi, S. Zhang, C. Zheng, G. Lu,K. Zhan, X. Li, and B. Qiu, “Bigdatabench: A big databenchmark suite from internet services,” in IEEEInternational Symposium On High Performance ComputerArchitecture (HPCA), 2014.
[4] M. Ferdman, A. Adileh, O. Kocberber, S. Volos,M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu,A. Ailamaki, and B. Falsafi, “Clearing the clouds: A studyof emerging workloads on modern hardware,” in ACM
International Conference on Architectural Support forProgramming Languages and Operating Systems(ASPLOS), 2012.
[5] D. J. Lilja, Measuring computer performance: apractitioner’s guide. Cambridge university press, 2005.
[6] J. L. Hennessy and D. A. Patterson, Computerarchitecture: a quantitative approach. Elsevier, 2011.
[7] D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning,R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson,T. A. Lasinski, R. S. Schreiber, H. D. Simon,V. Venkatakrishnan, and S. K. Weeratunga, “The nasparallel benchmarks,” The International Journal ofSupercomputing Applications, vol. 5, no. 3, pp. 63–73, 1991.
[8] T. M. Conte, M. A. Hirsch, and K. N. Menezes, “Reducingstate loss for effective trace sampling of superscalarprocessors,” in IEEE International Conference onComputer Design (ICCD), 1996.
[9] R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe,“Smarts: Accelerating microarchitecture simulation viarigorous statistical sampling,” in IEEE InternationalSymposium on Computer Architecture (ISCA), 2003.
[10] Z. Yu, H. Jin, J. Chen, and L. K. John, “Tss: Applyingtwo-stage sampling in micro-architecture simulations,” inIEEE International Symposium on Modeling, Analysis,Simulation of Computer and Telecommunication Systems(MASCOTS), 2009.
[11] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder,“Automatically characterizing large scale programbehavior,” in ACM SIGARCH Computer ArchitectureNews, vol. 30, pp. 45–57, 2002.
[12] K. Skadron, M. Martonosi, D. August, M. Hill, D. Lilja, andV. S. Pai, “Challenges in computer architecture evaluation,”IEEE Computer, vol. 36, no. 8, pp. 30–36, 2003.
[13] L. Eeckhout, R. H. Bell Jr, B. Stougie, K. De Bosschere,and L. K. John, “Control flow modeling in statisticalsimulation for accurate and efficient processor designstudies,”ACM SIGARCH Computer Architecture News,vol. 32, no. 2, p. 350, 2004.
[14] L. Eeckhout, K. De Bosschere, and H. Neefs, “Performanceanalysis through synthetic trace generation,” in IEEEInternational Symposium on Performance Analysis ofSystems and Software (ISPASS), 2000.
[15] P. Colella, “Defining software requirements for scientificcomputing,” 2004.
[16] http://www.krellinst.org/doecsgf/conf/2014/pres/jhill.pdf.
[17] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis,P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker,J. Shalf, S. W. Williams, and Y. Katherine, “The landscapeof parallel computing research: A view from berkeley,” tech.rep., Technical Report UCB/EECS-2006-183, EECSDepartment, University of California, Berkeley, 2006.
[18] Z. Jia, J. Zhan, L. Wang, R. Han, S. A. McKee, Q. Yang,C. Luo, and J. Li, “Characterizing and subsetting big dataworkloads,” in IEEE International Symposium onWorkload Characterization (IISWC), 2014.
[19] J. L. Abellan, C. Chen, and A. Joshi, “Electro-photonic nocdesigns for kilocore systems,”ACM Journal on EmergingTechnologies in Computing Systems (JETC), vol. 13, no. 2,p. 24, 2016.
[20] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt,A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna,S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D.Hill, and D. A. Wood, “The gem5 simulator,”ACMSIGARCH Computer Architecture News, vol. 39, no. 2,pp. 1–7, 2011.
[21] L. Dagum and R. Menon, “Openmp: an industry standardapi for shared-memory programming,” IEEE computationalscience and engineering, vol. 5, no. 1, pp. 46–55, 1998.
[22] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, “Ahigh-performance, portable implementation of the mpi
message passing interface standard,” Parallel computing,vol. 22, no. 6, pp. 789–828, 1996.
[23] “Hadoop.” http://hadoop.apache.org/.
[24] “Spark.” https://spark.apache.org/.
[25] “Gensort.” http://www.ordinal.com/gensort.html.
[26] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, andL. Fei-Fei, “ImageNet: A Large-Scale Hierarchical ImageDatabase,” in CVPR09, 2009.
[27] Z. Ming, C. Luo, W. Gao, R. Han, Q. Yang, L. Wang, andJ. Zhan, “Bdgs: A scalable big data generator suite in bigdata benchmarking,” arXiv preprint arXiv:1401.5465, 2014.
[28] R. Intel, “Intel r 64 and ia-32 architectures. softwaredevelopers manual. volume 3a,” System ProgrammingGuide, Part, vol. 1, 2010.
[29] “Perf tool.”https://perf.wiki.kernel.org/index.php/Main Page.
[30] A. J. KleinOsowski, J. Flynn, N. Meares, and D. J. Lilja,Adapting the SPEC 2000 Benchmark Suite forSimulation-Based Computer Architecture Research,pp. 83–100. Boston, MA: Springer US, 2001.
[31] A. KleinOsowski and D. J. Lilja, “Minnespec: A new specbenchmark workload for simulation-based computerarchitecture research,” IEEE Computer ArchitectureLetters, vol. 1, no. 1, pp. 7–7, 2002.
[32] F. H. McMahon, “The livermore fortran kernels: Acomputer test of the numerical performance range,” tech.rep., Lawrence Livermore National Lab., CA (USA), 1986.
[33] J. J. Dongarra, P. Luszczek, and A. Petitet, “The linpackbenchmark: past, present and future,” Concurrency andComputation: practice and experience, vol. 15, no. 9,pp. 803–820, 2003.
[34] R. H. Bell Jr and L. K. John, “Improved automatic testcasesynthesis for performance model validation,” in ACMInternational Conference on Supercomputing (ICS), 2005.
[35] K. Ganesan, J. Jo, and L. K. John, “Synthesizingmemory-level parallelism aware miniature clones for speccpu2006 and implantbench workloads,” in IEEEInternational Symposium on Performance Analysis ofSystems Software (ISPASS), 2010.
[36] S. Nussbaum and J. E. Smith, “Modeling superscalarprocessors via statistical simulation,” in InternationalConference on Parallel Architectures and CompilationTechniques (PACT), 2001.
[37] M. Oskin, F. T. Chong, and M. Farrens, HLS: Combiningstatistical and symbolic simulation to guide microprocessordesigns, vol. 28. ACM, 2000.
[38] L. Eeckhout and K. De Bosschere, “Early design phasepower/performance modeling through statisticalsimulation.,” in IEEE International Symposium onPerformance Analysis of Systems and Software (ISPASS),2001.
[39] A. M. Joshi, Constructing adaptable and scalable syntheticbenchmarks for microprocessor performance evaluation.ProQuest, 2007.
[40] S. Kim, F. Liu, Y. Solihin, R. Iyer, L. Zhao, and W. Cohen,“Accelerating full-system simulation through characterizingand predicting operating system performance,” in IEEEInternational Symposium on Performance Analysis ofSystems and Software (ISPASS), 2007.
[41] R. Panda and L. K. John, “Proxy benchmarks for emergingbig-data workloads,” in IEEE International Symposium onPerformance Analysis of Systems and Software (ISPASS),2017.
[42] E. F. Codd, “A relational model of data for large shareddata banks,” Communications of the ACM, vol. 13, no. 6,pp. 377–387, 1970.
[43] N. Council, “Frontiers in massive data analysis,” TheNational Academies Press Washington, DC, 2013.