benchmarking datacenter and big data systems
DESCRIPTION
Benchmarking Datacenter and Big Data Systems. Wanling Gao , Zhen Jia , Lei Wang, Yuqing Zhu, Chunjie Luo , Yingjie Shi, Yongqiang He, Shiming Gong, Xiaona Li, Shujie Zhang, Bizhu Qiu , Lixin Zhang, Jianfeng Zhan. http://prof.ict.ac.cn/jfzhan . Acknowledgements. - PowerPoint PPT PresentationTRANSCRIPT
INSTITU
TE OF CO
MPU
TING
TECH
NO
LOG
Y
Benchmarking Datacenter and BigData Systems
Wanling Gao, Zhen Jia, Lei Wang, Yuqing Zhu, Chunjie Luo, Yingjie Shi, Yongqiang He, Shiming Gong, Xiaona Li, Shujie
Zhang, Bizhu Qiu, Lixin Zhang, Jianfeng Zhan
http://prof.ict.ac.cn/jfzhan
1
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
Acknowledgements
This work is supported by the Chinese 973 project (Grant No.2011CB302502), the Hi-Tech Research and Development (863) Program of China (Grant No.2011AA01A203, No.2013AA01A213), the NSFC project (Grant No.60933003, No.61202075) , the BNSFproject (Grant No.4133081), and Huawei funding.
2/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
Executive summary
An open-source project on datacenter and big data benchmarking ICTBench http://prof.ict.ac.cn/ICTBench
Several case studies using ICTBench
3/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
Question One
Gap between Industry and Academia Longer and longer distance
• Code• Data sets
4/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
Question Two
Different benchmark requirements Architecture communities
• Simulation is very slow• Small data and code sets
System communities• Large-scale deployment is valuable.
Users need real-world applications• There are three kinds of lies: lies, damn lies, and
benchmarks
5/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
State-of-Practice Benchmark Suites
SPEC CPU SPEC Web HPCC PARSEC
TPCCYCSBGridmix
6/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
Why a New Benchmark Suite for Datacenter Computing No benchmark suite covers diversity of data
center workloads
State-of-art: CloudSuite Only includes six applications according to
their popularity
7/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
Memory Level Parallelism(MLP): Simultaneously outstanding cache misses
Why a New Benchmark Suite (Cont’)
MLP
8/
CloudSuite
our benchmark suite
DCBench
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
Scale-out performance
Why a New Benchmark Suite (Cont’)
1 4 81
2
3
4
5
6
sortgrepwordcountsvmkmeansfkmeansall-pairsBayesHMM
Spe
ed u
p
Cloudsuite Data analysis benchmark
Working nodes
DCBench
9/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
Outline
• Background and Motivation
• Our ICTBench
• Case studies
10/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
ICTBench Project ICTBench: three benchmark suites
DCBench: architecture (application, OS, and VM execution)
BigDataBench: system (large-scale big data applications) CloudRank: Cloud benchmarks (distributed
managements) not covered in this talk Project homepage
http://prof.ict.ac.cn/ICTBench The source code is available
11/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
DCBench DCBench: typical data center workloads
Different from scientific computing: FLOPS Cover applications in important domains
• Search engine, electronic commence etc. Each benchmark = a single application
Purposes Architecture system (small-to-medium) researches
12/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
BigDataBench Characterizing big data applications
Not including data-intensive super computing Synthetic data sets varying from 10G~ PB Each benchmark = a single big application.
Purposes large-scale system and architecture researches
13/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
CloudRank Cloud computing
Elastic resource management Consolidating different workloads
Cloud benchmarks Each benchmark = a group of consolidated data
center workloads. services/ data processing/ desktop
Purposes Capacity planning, system evaluation and researches User can customize their benchmarks.
14/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
Benchmarking Methodology To decide and rank main application domains
according to a publicly available metric e.g. page view and daily visitors
To single out the main applications from main applications domains
15/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
Top Sites on the Web
More details in http://www.alexa.com/topsites/global;0
40%
25%
15%
5%
15%Search Engine Social Network Electronic Commerce
Media Streaming Others
Top Sites on the Web
16/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
Benchmarking Methodology To decide and rank main application domains
according to a publicly available metric e.g. page view and daily visitors
To single out the main applications from main applications domains
17/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
40%
25%
15%
5%15%
Search Engine Social NetworkElectronic Commerce Media StreamingOthers
Main algorithms in Search Engine
Algorithms used in Search:PagerankGraph miningSegmentationFeature ReductionGrepStatistical countingVector calculationsortRecommendation……
Top Sites on The Web
18/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
Main Algorithms in Search Engines ( Nutch)
Word GrepWord Count
Segmentation
SortClassificationDecisionTree
BFSSegmentation Scoring & Sort
Merge SortVector calculate
PageRank
19/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
40%
25%
15%
5%15%
Search Engine Social NetworkElectronic Commerce Media StreamingOthers
Main Algorithms in Social Networks
Algorithms used in Social Network:RecommendationClustering ClassificationGraph miningGrepFeature ReductionStatistical countingVector calculationSort……
Top Sites on The Web
20/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
40%
25%
15%
5%15%
Search Engine Social NetworkElectronic Commerce Media StreamingOthers
Main Algorithms in Electronic Commerce
Algorithms used in electronic commerce:RecommendationAssociate rule miningWarehouse operationClustering ClassificationStatistical countingVector calculation……
Top Sites on The Web
21/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
Overview of DCBenchCategory Workloads Programmin
g modellanguage source
Basic operation Sort MapReduce Java HadoopWordcount MapReduce Java HadoopGrep MapReduce Java Hadoop
Classification Naïve Bayes MapReduce Java MahoutSupport Vector Machine
MapReduce Java Implemented by ourself
Cluster K-means MapReduce Java MahoutMPI C++ IBM PML
Fuzzy k-means MapReduce Java MahoutMPI C++ IBM PML
Recommendation
Item based Collaborative Filtering
MapReduce Java Mahout
Association rule mining
Frequent pattern growth
MapReduce Java Mahout
Segmentation Hidden Markov model MapReduce Java Implemented by ourself
22/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
Category Workloads Programming model
language source
Warehouse operation
Database operations MapReduce Java Hive-bench
Feature reduction
Principal Component Analysis
MPI C++ IBM PML
Kernel Principal Component Analysis
MPI C++ IBM PML
Vector calculate Paper similarity analysis
All-Pairs C&C++ Implemented by ourself
Graph mining Breadth-first search MPI C++ Graph500
Pagerank MapReduce Java MahoutService Search engine C/S Java nutch
Auction C/S Java RubisService Media streaming C/S Java Cloudsuite
Overview of DCBench (Cont’)
23/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
Methodology of Generating Big Data To preserve the characteristics of real-world
data
Small-scale Data Big Data
Characteristic Analysis Expand
Semantice.g. word frequency
Word reuse distance
Word distribution in documents
24/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
Workloads in BigDataBench 1.0 Beta
Analysis Workloads Simple but representative operations
• Sort, Grep, Wordcount Highly recognized algorithms
• Naïve Bayes, SVM
Search Engine Service Workloads Widely deployed services
• Nutch Server
25/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
Variety of Workloads are Included
Workloads
Off-line
Base Operations
I/O boundSort
CPU bound
Wordcount
HybridGrep
Machine Learning
Naïve Bayes SVM
On-line
Nutch Server
26/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
Features of WorkloadsWorkloads Resource
Characteristic Computing Complexity Instructions
Sort I/O bound O(n*lgn) Integer comparison domination
WordcountCPU bound
O(n)Integer comparison and calculation domination
GrepHybrid
O(n)Integer comparison
domination
Naïve Bayes/ O(m*n)
[m: the length of dictionary]
Floating-point computation domination
SVM/ O(M*n)
[M: the number of support vectors * dimension]
Floating-point computation domination
Nutch ServerI/O & CPU bound Integer comparison
domination
27/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
Content
• Background and Motivation
• Our ICTBench
• Case studies
28/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
Use Case 1: Microarchitecture Characterization
Using DCBench Five nodes cluster
one mater and four slaves(working nodes) Each node:
29/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
Instructions Execution level
DCBench: Data analysis workloads have more app-level instructions Service workloads have higher percentages of kernel-level
instructions
Naive BayesGrep
K-mean
s
PageRan
k
Hive-benchHMM
Media St
reaming
Web Search
SPECFP
SPECWeb
HPCC-DGEM
M
HPCC-HPL
HPCC-RandomAcce
ss0%
10%20%30%40%50%60%70%80%90%
100%kernel application
service
Data analysis
30/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
Pipeline Stall DC workloads have severe front end stall (i.e. instruction
fetch stall) Services: more RAT(Register Allocation Table) stall Data analysis: more RS(Reservation Station) and ROB(ReOrder Buffer) full
stall
Naive B
ayes
SVM
Grep
WordCount
K-mea
ns
Fuzzy
K-mean
s
PageR
ank
Sort
Hive-ben
ch IBCFHMM avg
Software
Testing
Media S
tream
ing
Data Se
rving
Web Se
arch
Web Se
rving
SPEC
FP
SPEC
INT
SPEC
Web
HPCC-COMM
HPCC-DGEMM
HPCC-FFT
HPCC-HPL
HPCC-PTRANS
HPCC-RandomAcce
ss
HPCC-STREA
M0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Instruction fetch_stall Rat_stall load_stall RS_full stall store_stall ROB_full stall
31/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
Architecture Block Diagram
32/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
Front End Stall Reasons For DC, High Instruction cache miss and Instruction TLB
miss make the front end inefficiency
Naive Baye
sSVM
Grep
WordCount
K-means
Fuzzy K-m
eans
PageRankSort
Hive-bench IBCF
HMM avg
Software Testing
Media Streaming
Data Serving
Web Search
Web Serving
SPECFP
SPECINT
SPECWeb
HPCC-COMM
HPCC-DGEMM
HPCC-FFT
HPCC-HPL
HPCC-PTRANS
HPCC-RandomAcce
ss
HPCC-STREAM0
20
40
60
80
100
L1 I
Cach
e M
iss p
er K
-Inst
ructi
on
Naive Baye
sSVM
Grep
WordCount
K-means
Fuzzy K-m
eans
PageRankSort
Hive-bench IBCF
HMM avg
Software Testing
Media Streaming
Data Serving
Web Search
Web Serving
SPECFP
SPECINT
SPECWeb
HPCC-COMM
HPCC-DGEMM
HPCC-FFT
HPCC-HPL
HPCC-PTRANS
HPCC-RandomAcce
ss
HPCC-STREAM-0.0499999999999997
2.91433543964104E-16
0.0500000000000003
0.1
0.15
0.2
0.25
0.3
0.35
ITLB
Pag
e W
alks
per
K-in
stru
ction
33/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
MLC Behaviors DC workloads have more MLC misses than HPC
Data analysis workloads own better locality (less L2 cache misses)
Naive B
ayes
Grep
K-mea
ns
PageR
ank
Hive-be
nch
HMM
Media
Stream
ing
Web
Sea
rch
SPECFP
SPECWeb
HPCC-DGEMM
HPCC-HPL
HPCC-Ran
domAcc
ess
0
20
40
60
80
100
L2 C
ache
mis
ses
per k
-Inst
ruct
ion
Data analysis
Service
HPCC
34/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
LLC Behaviors LLC is good enough for DC workloads
Most L2 cache misses can be satisfied by LLC
Naive Bayes
Grep
K-mean
s
PageRan
k
Hive-bench
HMM
Media St
reaming
Web Searc
h
SPECFP
SPEC
Web
HPCC-DGEM
M
HPCC-HPL
HPCC-RandomAccess
0%10%20%30%40%50%60%70%80%90%
100%
The
ratio
of L
3 Ca
che
satis
fed
L2
Cach
e M
iss
35/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
DTLB Behaviors DC workloads own more DTLB miss than HPC
Most data analysis workloads have less DTLB miss
Naive Bayes
Grep
K-mea
ns
PageRank
Hive-benchHMM
Media St
reaming
Web Searc
h
SPEC
FP
SPECW
eb
HPCC-DGEM
M
HPCC-HPL
HPCC-RandomAccess
0
0.5
1
1.5
2
2.5
Page
Wal
ks p
er K
-Inst
ructi
on
Data analysis Service HPCC
36/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
Branch Prediction DC:
Data analysis workloads have pretty good branch behaviors
Service’s branch is hard to predict
Naive BayesGrep
K-mean
s
PageRank
Hive-bench
HMM
Media St
reaming
Web Searc
h
SPECFP
SPECWeb
HPCC-DGEM
M
HPCC-HPL
HPCC-RandomAcce
ss0.00%
1.00%
2.00%
3.00%
4.00%
5.00%
6.00%
7.00%
8.00%
Bran
ch m
ispr
edic
tion
ratio
Data analysis
Service
HPCC
37/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
DC Workloads Characteristics Data analysis applications share many inherent
characteristics, which place them in a different class from desktop, HPC, traditional server and scale-out service workloads.
More details can be found at our IISWC 2013 paper. Characterizing Data Analysis Workloads in Data
Centers. Zhen Jia, et al. 2013 IEEE International Symposium on Workload Characterization ( IISWC-2013)
38/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
Use Case 2: Architecture Research
Using BigDataBench 1.0 Beta Data Scale
10 GB – 2 TB Hadoop Configuration
1 master 14 slave node
39/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
Use Case 2: Architecture Research
Some micro-architectural events are tending towards stability when the data volume increases to a certain extent
Cache and TLB behaviors have different trends with increasing data volumes for different workloads
L1I_miss/1000ins: increase for Sort, decrease for Grep
40/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
Search Engine Service Experiments Same phenomena is
observed Micro-architectural events
are tending towards stability when the index size increases to a certain extent
Big data impose challenges to architecture researches since large-scale simulation is time-consuming
Index size: 2GB ~ 8GBSegment size: 4.4GB ~ 17.6GB
41/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
Use Case 3: System Evaluation
Using BigDataBench 1.0 Beta Data Scale
10 GB – 2 TB Hadoop Configuration
1 master 14 slave node
42/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
System Evaluation a threshold for each workload
100MB ~ 1TB System is fully loaded when the data
volume exceeds the threshold Sort is an exception
An inflexion point(10GB ~ 1TB) Data processing rate decreases after
this point Global data access requirements
• I/O and network bottleneck System performance is dependent
on applications and data volumes.
43/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
Conclusion
ICTBench DCBench BigDataBench CloudRank
An open-source project on datacenter and big data benchmarking http://prof.ict.ac.cn/ICTBench
44/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
Publications BigDataBench: a Big Data Benchmark Suite from Web Search Engines. Wanling Gao, et
al. The Third Workshop on Architectures and Systems for Big Data (ASBD 2013) in conjunction with ISCA 2013.
Characterizing Data Analysis Workloads in Data Centers. Zhen Jia, et al. 2013 IEEE International Symposium on Workload Characterization ( IISWC-2013)
Characterizing OS behavior of Scale-out Data Center Workloads. Chen Zheng et al. Seventh Annual Workshop on the Interaction amongst Virtualization, Operating Systems and Computer Architecture (WIVOSCA 2013). In Conjunction with ISCA 2013.[
Characterization of Real Workloads of Web Search Engines. Huafeng Xi et al. 2011 IEEE International Symposium on Workload Characterization ( IISWC-2011).
The Implications of Diverse Applications and Scalable Data Sets in Benchmarking Big Data Systems. Zhen Jia et al. Second workshop of big data benchmarking (WBDB 2012 India) & Lecture Note in Computer Science (LNCS)
CloudRank-D: Benchmarking and Ranking Cloud Computing Systems for Data Processing Applications. Chunjie Luo et al. Front. Comput. Sci. (FCS) 2012, 6(4): 347–362
45/
Big Data Benchmarking WorkshopBig Data Benchmarking Workshop
Thank you! Any questions?
46/