benchmarking datacenter and big data systems

INSTITU

TE OF CO

MPU

TING

TECH

NO

LOG

Y

Benchmarking Datacenter and BigData Systems

Wanling Gao, Zhen Jia, Lei Wang, Yuqing Zhu, Chunjie Luo, Yingjie Shi, Yongqiang He, Shiming Gong, Xiaona Li, Shujie

Zhang, Bizhu Qiu, Lixin Zhang, Jianfeng Zhan

http://prof.ict.ac.cn/jfzhan

1

http://prof.ict.ac.cn/

Big Data Benchmarking WorkshopBig Data Benchmarking Workshop

Acknowledgements

This work is supported by the Chinese 973 project (Grant No.2011CB302502), the Hi-Tech Research and Development (863) Program of China (Grant No.2011AA01A203, No.2013AA01A213), the NSFC project (Grant No.60933003, No.61202075) , the BNSFproject (Grant No.4133081), and Huawei funding.

2/


Executive summary

An open-source project on datacenter and big data benchmarking ICTBench http://prof.ict.ac.cn/ICTBench

Several case studies using ICTBench

3/

http://prof.ict.ac.cn/ICTBench


Question One

Gap between Industry and Academia Longer and longer distance

• Code• Data sets

4/


Question Two

Different benchmark requirements Architecture communities

• Simulation is very slow• Small data and code sets

System communities• Large-scale deployment is valuable.

Users need real-world applications• There are three kinds of lies: lies, damn lies, and

benchmarks

5/


State-of-Practice Benchmark Suites

SPEC CPU SPEC Web HPCC PARSEC

TPCCYCSBGridmix

6/


Why a New Benchmark Suite for Datacenter Computing No benchmark suite covers diversity of data

center workloads

State-of-art: CloudSuite Only includes six applications according to

their popularity

7/


Memory Level Parallelism(MLP): Simultaneously outstanding cache misses

Why a New Benchmark Suite (Cont’)

MLP

8/

CloudSuite

our benchmark suite

DCBench


Scale-out performance

Why a New Benchmark Suite (Cont’)

1 4 81

2

3

4

5

6

sortgrepwordcountsvmkmeansfkmeansall-pairsBayesHMM

Spe

ed u

p

Cloudsuite Data analysis benchmark

Working nodes

DCBench

9/


Outline

• Background and Motivation

• Our ICTBench

• Case studies

10/


ICTBench Project ICTBench: three benchmark suites

DCBench: architecture (application, OS, and VM execution)

BigDataBench: system (large-scale big data applications) CloudRank: Cloud benchmarks (distributed

managements) not covered in this talk Project homepage

http://prof.ict.ac.cn/ICTBench The source code is available

11/

http://prof.ict.ac.cn/ICTBench


DCBench DCBench: typical data center workloads

Different from scientific computing: FLOPS Cover applications in important domains

• Search engine, electronic commence etc. Each benchmark = a single application

Purposes Architecture system (small-to-medium) researches

12/


BigDataBench Characterizing big data applications

Not including data-intensive super computing Synthetic data sets varying from 10G~ PB Each benchmark = a single big application.

Purposes large-scale system and architecture researches

13/


CloudRank Cloud computing

Elastic resource management Consolidating different workloads

Cloud benchmarks Each benchmark = a group of consolidated data

center workloads. services/ data processing/ desktop

Purposes Capacity planning, system evaluation and researches User can customize their benchmarks.

14/


Benchmarking Methodology To decide and rank main application domains

according to a publicly available metric e.g. page view and daily visitors

To single out the main applications from main applications domains

15/


Top Sites on the Web

More details in http://www.alexa.com/topsites/global;0

40%

25%

15%

5%

15%Search Engine Social Network Electronic Commerce

Media Streaming Others

Top Sites on the Web

16/

http://www.alexa.com/topsites/global;0


Benchmarking Methodology To decide and rank main application domains

according to a publicly available metric e.g. page view and daily visitors

To single out the main applications from main applications domains

17/


40%

25%

15%

5%15%

Search Engine Social NetworkElectronic Commerce Media StreamingOthers

Main algorithms in Search Engine

Algorithms used in Search:PagerankGraph miningSegmentationFeature ReductionGrepStatistical countingVector calculationsortRecommendation……

Top Sites on The Web

18/


Main Algorithms in Search Engines （ Nutch）

Word GrepWord Count

Segmentation

SortClassificationDecisionTree

BFSSegmentation Scoring & Sort

Merge SortVector calculate

PageRank

19/


40%

25%

15%

5%15%


Main Algorithms in Social Networks

Algorithms used in Social Network:RecommendationClustering ClassificationGraph miningGrepFeature ReductionStatistical countingVector calculationSort……


20/


40%

25%

15%

5%15%


Main Algorithms in Electronic Commerce

Algorithms used in electronic commerce:RecommendationAssociate rule miningWarehouse operationClustering ClassificationStatistical countingVector calculation……


21/


Overview of DCBenchCategory Workloads Programmin

g modellanguage source

Basic operation Sort MapReduce Java HadoopWordcount MapReduce Java HadoopGrep MapReduce Java Hadoop

Classification Naïve Bayes MapReduce Java MahoutSupport Vector Machine

MapReduce Java Implemented by ourself

Cluster K-means MapReduce Java MahoutMPI C++ IBM PML

Fuzzy k-means MapReduce Java MahoutMPI C++ IBM PML

Recommendation

Item based Collaborative Filtering

MapReduce Java Mahout

Association rule mining

Frequent pattern growth

MapReduce Java Mahout

Segmentation Hidden Markov model MapReduce Java Implemented by ourself

22/


Category Workloads Programming model

language source

Warehouse operation

Database operations MapReduce Java Hive-bench

Feature reduction

Principal Component Analysis

MPI C++ IBM PML

Kernel Principal Component Analysis

MPI C++ IBM PML

Vector calculate Paper similarity analysis

All-Pairs C&C++ Implemented by ourself

Graph mining Breadth-first search MPI C++ Graph500

Pagerank MapReduce Java MahoutService Search engine C/S Java nutch

Auction C/S Java RubisService Media streaming C/S Java Cloudsuite

Overview of DCBench (Cont’)

23/


Methodology of Generating Big Data To preserve the characteristics of real-world

data

Small-scale Data Big Data

Characteristic Analysis Expand

Semantice.g. word frequency

Word reuse distance

Word distribution in documents

24/


Workloads in BigDataBench 1.0 Beta

Analysis Workloads Simple but representative operations

• Sort, Grep, Wordcount Highly recognized algorithms

• Naïve Bayes, SVM

Search Engine Service Workloads Widely deployed services

• Nutch Server

25/


Variety of Workloads are Included

Workloads

Off-line

Base Operations

I/O boundSort

CPU bound

Wordcount

HybridGrep

Machine Learning

Naïve Bayes SVM

On-line

Nutch Server

26/


Features of WorkloadsWorkloads Resource

Characteristic Computing Complexity Instructions

Sort I/O bound O(n*lgn) Integer comparison domination

WordcountCPU bound

O(n)Integer comparison and calculation domination

GrepHybrid

O(n)Integer comparison

domination

Naïve Bayes/ O(m*n)

[m: the length of dictionary]

Floating-point computation domination

SVM/ O(M*n)

[M: the number of support vectors * dimension]

Floating-point computation domination

Nutch ServerI/O & CPU bound Integer comparison

domination

27/


Content

• Background and Motivation

• Our ICTBench

• Case studies

28/


Use Case 1: Microarchitecture Characterization

Using DCBench Five nodes cluster

one mater and four slaves(working nodes) Each node:

29/


Instructions Execution level

DCBench: Data analysis workloads have more app-level instructions Service workloads have higher percentages of kernel-level

instructions

Naive BayesGrep

K-mean

s

PageRan

k

Hive-benchHMM

Media St

reaming

Web Search

SPECFP

SPECWeb

HPCC-DGEM

M

HPCC-HPL

HPCC-RandomAcce

ss0%

10%20%30%40%50%60%70%80%90%

100%kernel application

service

Data analysis

30/


Pipeline Stall DC workloads have severe front end stall (i.e. instruction

fetch stall) Services: more RAT(Register Allocation Table) stall Data analysis: more RS(Reservation Station) and ROB(ReOrder Buffer) full

stall

Naive B

ayes

SVM

Grep

WordCount

K-mea

ns

Fuzzy

K-mean

s

PageR

ank

Sort

Hive-ben

ch IBCFHMM avg

Software

Testing

Media S

tream

ing

Data Se

rving

Web Se

arch

Web Se

rving

SPEC

FP

SPEC

INT

SPEC

Web

HPCC-COMM

HPCC-DGEMM

HPCC-FFT

HPCC-HPL

HPCC-PTRANS

HPCC-RandomAcce

ss

HPCC-STREA

M0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Instruction fetch_stall Rat_stall load_stall RS_full stall store_stall ROB_full stall

31/


Architecture Block Diagram

32/


Front End Stall Reasons For DC, High Instruction cache miss and Instruction TLB

miss make the front end inefficiency

Naive Baye

sSVM

Grep

WordCount

K-means

Fuzzy K-m

eans

PageRankSort

Hive-bench IBCF

HMM avg

Software Testing

Media Streaming

Data Serving

Web Search

Web Serving

SPECFP

SPECINT

SPECWeb

HPCC-COMM

HPCC-DGEMM

HPCC-FFT

HPCC-HPL

HPCC-PTRANS

HPCC-RandomAcce

ss

HPCC-STREAM0

20

40

60

80

100

L1 I

Cach

e M

iss p

er K

-Inst

ructi

on

Naive Baye

sSVM

Grep

WordCount

K-means

Fuzzy K-m

eans

PageRankSort

Hive-bench IBCF

HMM avg

Software Testing

Media Streaming

Data Serving

Web Search

Web Serving

SPECFP

SPECINT

SPECWeb

HPCC-COMM

HPCC-DGEMM

HPCC-FFT

HPCC-HPL

HPCC-PTRANS

HPCC-RandomAcce

ss

HPCC-STREAM-0.0499999999999997

2.91433543964104E-16

0.0500000000000003

0.1

0.15

0.2

0.25

0.3

0.35

ITLB

Pag

e W

alks

per

K-in

stru

ction

33/


MLC Behaviors DC workloads have more MLC misses than HPC

Data analysis workloads own better locality (less L2 cache misses)

Naive B

ayes

Grep

K-mea

ns

PageR

ank

Hive-be

nch

HMM

Media

Stream

ing

Web

Sea

rch

SPECFP

SPECWeb

HPCC-DGEMM

HPCC-HPL

HPCC-Ran

domAcc

ess

0

20

40

60

80

100

L2 C

ache

mis

ses

per k

-Inst

ruct

ion

Data analysis

Service

HPCC

34/


LLC Behaviors LLC is good enough for DC workloads

Most L2 cache misses can be satisfied by LLC

Naive Bayes

Grep

K-mean

s

PageRan

k

Hive-bench

HMM

Media St

reaming

Web Searc

h

SPECFP

SPEC

Web

HPCC-DGEM

M

HPCC-HPL

HPCC-RandomAccess

0%10%20%30%40%50%60%70%80%90%

100%

The

ratio

of L

3 Ca

che

satis

fed

L2

Cach

e M

iss

35/


DTLB Behaviors DC workloads own more DTLB miss than HPC

Most data analysis workloads have less DTLB miss

Naive Bayes

Grep

K-mea

ns

PageRank

Hive-benchHMM

Media St

reaming

Web Searc

h

SPEC

FP

SPECW

eb

HPCC-DGEM

M

HPCC-HPL

HPCC-RandomAccess

0

0.5

1

1.5

2

2.5

Page

Wal

ks p

er K

-Inst

ructi

on

Data analysis Service HPCC

36/


Branch Prediction DC:

Data analysis workloads have pretty good branch behaviors

Service’s branch is hard to predict

Naive BayesGrep

K-mean

s

PageRank

Hive-bench

HMM

Media St

reaming

Web Searc

h

SPECFP

SPECWeb

HPCC-DGEM

M

HPCC-HPL

HPCC-RandomAcce

ss0.00%

1.00%

2.00%

3.00%

4.00%

5.00%

6.00%

7.00%

8.00%

Bran

ch m

ispr

edic

tion

ratio

Data analysis

Service

HPCC

37/


DC Workloads Characteristics Data analysis applications share many inherent

characteristics, which place them in a different class from desktop, HPC, traditional server and scale-out service workloads.

More details can be found at our IISWC 2013 paper. Characterizing Data Analysis Workloads in Data

Centers. Zhen Jia, et al. 2013 IEEE International Symposium on Workload Characterization （ IISWC-2013)

38/


Use Case 2: Architecture Research

Using BigDataBench 1.0 Beta Data Scale

10 GB – 2 TB Hadoop Configuration

1 master 14 slave node

39/


Use Case 2: Architecture Research

Some micro-architectural events are tending towards stability when the data volume increases to a certain extent

Cache and TLB behaviors have different trends with increasing data volumes for different workloads

L1I_miss/1000ins: increase for Sort, decrease for Grep

40/


Search Engine Service Experiments Same phenomena is

observed Micro-architectural events

are tending towards stability when the index size increases to a certain extent

Big data impose challenges to architecture researches since large-scale simulation is time-consuming

Index size： 2GB ~ 8GBSegment size： 4.4GB ~ 17.6GB

41/


Use Case 3: System Evaluation

Using BigDataBench 1.0 Beta Data Scale

10 GB – 2 TB Hadoop Configuration

1 master 14 slave node

42/


System Evaluation a threshold for each workload

100MB ~ 1TB System is fully loaded when the data

volume exceeds the threshold Sort is an exception

An inflexion point(10GB ~ 1TB) Data processing rate decreases after

this point Global data access requirements

• I/O and network bottleneck System performance is dependent

on applications and data volumes.

43/


Conclusion

ICTBench DCBench BigDataBench CloudRank

An open-source project on datacenter and big data benchmarking http://prof.ict.ac.cn/ICTBench

44/

http://prof.ict.ac.cn/


Publications BigDataBench: a Big Data Benchmark Suite from Web Search Engines. Wanling Gao, et

al. The Third Workshop on Architectures and Systems for Big Data (ASBD 2013) in conjunction with ISCA 2013.

Characterizing Data Analysis Workloads in Data Centers. Zhen Jia, et al. 2013 IEEE International Symposium on Workload Characterization （ IISWC-2013)

Characterizing OS behavior of Scale-out Data Center Workloads. Chen Zheng et al. Seventh Annual Workshop on the Interaction amongst Virtualization, Operating Systems and Computer Architecture (WIVOSCA 2013). In Conjunction with ISCA 2013.[

Characterization of Real Workloads of Web Search Engines. Huafeng Xi et al. 2011 IEEE International Symposium on Workload Characterization （ IISWC-2011).

The Implications of Diverse Applications and Scalable Data Sets in Benchmarking Big Data Systems. Zhen Jia et al. Second workshop of big data benchmarking (WBDB 2012 India) & Lecture Note in Computer Science (LNCS)

CloudRank-D: Benchmarking and Ranking Cloud Computing Systems for Data Processing Applications. Chunjie Luo et al. Front. Comput. Sci. (FCS) 2012, 6(4): 347–362

45/


Thank you! Any questions?

46/

benchmarking datacenter and big data systems

Documents

slowsmall data

single big application

pbeach benchmark

benchmark suites dcbench

nsfc project grant

ictbench case studies

bnsfproject grant

opensource project