microbenchmarks - benchcouncil · n for a layer with d-dimensional input x = (x(1) . . . x(d))...
TRANSCRIPT
INSTITUTE O
F COM
PUTING
TECHN
OLO
GY
Micro Benchmarks
Wanling Gao
ICT,ChineseAcademyofSciences
HPCA2019, WashingtonD.C.,USA
BigDataBench HPCA2019
AscalablebigdataandAIbenchmarksuite
n Treat big data, AI and Internet service workloads as a pipeline of units of computation handling (input or intermediate) data
n Target: find the main abstractions of time-consuming units of computation (data motifs)n The combination of data motifs = complex workloads
• Similar to Relational Algebra
n Datamotifs-basedscalablebenchmarkingmethodologyWanling Gao,Jianfeng Zhan,LeiWang, et al.DataMotif: A Lens towards Fully Understanding BigDataandAIWorkloads.PACT 2018.
BigDataBench HPCA2019
BigDataBench Publicationsn DataMotifs:A Lens TowardsFullyUnderstandingBigDataand AIWorkloads.
PACT’18.n BigDataBench:aScalable and UnifiedBigDataandArtificialIntelligence
BenchmarkSuite.TechnicalReport.n UnderstandingBigDataAnalyticsWorkloadson ModernProcessors.TPDS’16n Auto-tuningSparkBigDataWorkloadsonPOWER8:Prediction-BasedDynamic
SMT.PACT’16n BigDataBench:aBigDataBenchmarkSuitefromInternetServices.HPCA’14n CVR:EfficientVectorizationofSpMV onX86Processors.CGO’18.n BOPS,NotFLOPS!ANewMetric,MeasuringTool,andRooflinePerformance
ModelForDatacenterComputing.Technicalreport.n Data Motif-based Proxy Benchmarks for Big Data and AI Workloads. IISWC
2018.
BigDataBench HPCA2019
Micro Benchmark Target
n Capture one class of unit of computation inbig data and AI
n Easily be ported to anewcomputersystemorarchitecture at an earlier stage
BigDataBench HPCA2019
Outline
n Summary of Micro Benchmark
n Micro Benchmark Characterization
n Conclusion
BigDataBench HPCA2019
Summary
n 27 micro benchmarksn Covering 6 workload types
• Offline analytics, Graph analytics• Streaming, NoSQL, Data warehouse• AI
n Covering 8 data motifs• Transform, Graph, Set, Sort, Matrix, Logic, Sampling, Basicstatistics
n Covering 5 application domains• Internet Service (Social network, Search engine, E-commerce)• Recognition Science• Medical Science
BigDataBench HPCA2019
MicroBenchmarks
AI
NoSQL
Offlineanalytics
Graphanalytics
Streaming
Datawarehouse
BigDataBench HPCA2019
Sort
n Sort the key value according to a certain order
n Data inputn Wikipedia entries
n Software stacksn Hadoop,Spark,Flink, MPI
BigDataBench HPCA2019
Grep
n Extract matchingstringsfromtextfilesandcountshowmanytimetheyoccurred
n Data inputn Wikipedia entries
n Software stacksn Hadoop,Spark,Flink, MPI
BigDataBench HPCA2019
WordCount
n Count thenumberof words inadocument
n Data inputn Wikipedia entries
n Software stacksn Hadoop,Spark,Flink, MPI
BigDataBench HPCA2019
MD5
n A widelyused hash function producinga128-bit hashvaluen Theinputmessageisbrokenupintochunksof512-bitblocks
n Data inputn Wikipedia entries
n Software stacksn Hadoop,Spark,MPI
BigDataBench HPCA2019
Connected Component
n A subgraph inwhichany two verticesare connected toeachotherby pathsn Easily computed in lineartimeusingeither breadth-firstsearch or depth-firstsearch
n Data inputn Facebooksocialnetwork
n Software stacksn Hadoop,Spark,Flink,GraphLab,MPI
BigDataBench HPCA2019
RandSample
n Selectasubsetsamples randomlyn Using a random data generator to determinewhether the data is selected or not
n Data inputn Wikipedia entries
n Software stacksn Hadoop,Spark,MPI
BigDataBench HPCA2019
FFT
n Cooley–Tukeyalgorithmn radix-2 decimation-in-time(DIT)FFT
n Data inputn Two-dimensional matrix
n Software stacksn Hadoop,Spark,MPI
BigDataBench HPCA2019
Matrix Multiply
n Compute a matrix from two matrics
n Data inputn Two-dimensional matrix
n Software stacksn Hadoop,Spark,MPI
BigDataBench HPCA2019
NoSQL ---Read, Write, Scan
n Benchmarksn Read records randomlyn Write new recordsn Scan records in order
n Data inputn ProfSearch resumes
• asemi-structureddatasetfromaverticalsearchengineforscientists
n Software stacksn Hbase, MongoDB
BigDataBench HPCA2019
OrderBy
n Order the data according to specific item
n Data inputn E-commercetransaction
n Software stacksn Hive,Spark-SQL,Impala
BigDataBench HPCA2019
Aggregation
n Gather information andaggregate inasummaryform
n Data inputn E-commercetransaction
n Software stacksn Hive,Spark-SQL,Impala
BigDataBench HPCA2019
Project
n Retrieve specified attributes(columns)
n Data inputn E-commercetransaction
n Software stacksn Hive,Spark-SQL,Impala
BigDataBench HPCA2019
Filter
n Selectpartial recordsthatmatchcertaincriteria
n Data inputn E-commercetransaction
n Software stacksn Hive,Spark-SQL,Impala
BigDataBench HPCA2019
Select
n Select a set ofrecordsfromoneormoretables
n Data inputn E-commercetransaction
n Software stacksn Hive,Spark-SQL,Impala
BigDataBench HPCA2019
Union
n CombinetheresultoftwoormoreSELECTstatements
n Data inputn E-commercetransaction
n Software stacksn Hive,Spark-SQL,Impala
BigDataBench HPCA2019
Convolution
n The general expression
n Data inputn Image dataset---Cifar, ImageNetn Convolution kernel
n Software stacksn TensorFlow, Caffe2, PyTorch, Pthread
Note:g(x,y) is the filtered image, f(x,y) is the original image, ω is the filter kernel
BigDataBench HPCA2019
Fully Connected
n Haveconnectionstoallneurons inthepreviouslayern Matrix multiplication followedbyabiasoffset
n Data inputn Image dataset---Cifar, ImageNet
n Software stacksn TensorFlow, Caffe2, PyTorch, Pthread
BigDataBench HPCA2019
Relu
n Abbreviationof rectified linear unitn Is definedasthepositivepartofitsargument
n Data inputn Image dataset---Cifar, ImageNet
n Software stacksn TensorFlow, Caffe2, PyTorch, Pthread
x is the input to a neuron
BigDataBench HPCA2019
Sigmoid
n Sigmoid activation function
n Data inputn Image dataset---Cifar, ImageNet
n Software stacksn TensorFlow, Caffe2, PyTorch, Pthread
BigDataBench HPCA2019
Tanh
n Tanh activation function
n Data inputn Image dataset---Cifar, ImageNet
n Software stacksn TensorFlow, Caffe2, PyTorch, Pthread
BigDataBench HPCA2019
MaxPooling
n Non-linear down-samplingn Dividing theinputimageintoasetofnon-overlappingrectangles
n Outputsthemaximum foreachsub-rectangle
n Data inputn Image dataset---Cifar, ImageNet
n Software stacksn TensorFlow, Caffe2, PyTorch, Pthread
BigDataBench HPCA2019
AvgPooling
n Non-linear down-samplingn Dividing theinputimageintoasetofnon-overlappingrectangles
n Outputstheaverage value foreachsub-rectangle
n Data inputn Image dataset---Cifar, ImageNet
n Software stacksn TensorFlow, Caffe2, PyTorch, Pthread
BigDataBench HPCA2019
Batch Normalization
n A normalizationmethod/layerforneuralnetworksn Foralayerwithd-dimensionalinputx=(x(1)...x(d))
n Data inputn Image dataset---Cifar, ImageNet
n Software stacksn TensorFlow, Caffe2, PyTorch, Pthread
Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
BigDataBench HPCA2019
Cosine Normalization
n UsingCosineSimilarityinNeuralNetworksn InsteadofDotProduct
n Data inputn Image dataset---Cifar, ImageNet
n Software stacksn TensorFlow, Caffe2, PyTorch, Pthread
where netnorm is the normalized pre-activation, w⃗ is the incoming weight vector and ⃗x is the input vector, (·) indicates dot product, f is nonlinear activation function
Luo C, Zhan J, Xue X, Wang L, Ren R, Yang Q. Cosine normalization: Using cosine similarity instead of dot product in neural networks. InInternational Conference on Artificial Neural Networks 2018 Oct 4 (pp. 382-391).
BigDataBench HPCA2019
Dropout
n A regularization techniqueforreducingoverfitting in neural networks
n Data inputn Image dataset---Cifar, ImageNet
n Software stacksn TensorFlow, Caffe2, PyTorch, Pthread
BigDataBench HPCA2019
Software Stacks
BigDataBench HPCA2019
Outline
n Summary of Micro Benchmark
n Micro Benchmark Characterization
n Conclusion
BigDataBench HPCA2019
Experiment Setups
n Three-node cluster
BigDataBench HPCA2019
Data Configuration
n To fully utilize the memory resourcesn Big data micro benchmarks
• 100 GB text data• 2^26-vertex graph data• 65536two-dimensionmatrixdata
n AI micro benchmarks• Input dimension 224*224, channels 64• 100K images from ImageNet
BigDataBench HPCA2019
System Behaviorsn CPU Utilization &I/OWait
n Hadoop have higer CPU utilization and less I/O wait than sparkn AI micro benchmarks have lower I/O wait than big datan Some of AI micro benchmarks are cpu intensiven Pthread benchmarks havelessCPUutilizationandI/OWaitingeneral
BigDataBench HPCA2019
I/O Behaviors
n Disk I/O Bandwidth &Network I/O Bandwidthn SparkstackhasmuchlargernetworkI/OpressurethanthatofHadoop
stack• Moredatashuffles,soitneedstransferringdatafromonenodetoanotherone
frequently
BigDataBench HPCA2019
Execution Performance
n The overall running efficiency of theworkloadsn Instruction level parallelism (ILP)
• Retired instructions per cycle (IPC)
n Memory level parallelism (MLP)• Dividing L1D_PEND_MISS.PENDINGbyL1D_PEND_MISS.PENDING_CYCLES
BigDataBench HPCA2019
Execution Performance
n ILP & MLP
n CoverawiderangeofILPandMLP behaviors
• Distinct computation andmemory access patterns
n softwarestackchangescomputationandmemoryaccesspatterns
• Hadoop FFT v.s. Spark FFT
BigDataBench HPCA2019
Top-DownMethod
n Issuepointasthedividingpoint
From“ATop-DownMethodforPerformanceAnalysisandCountersArchitecture”
Whetherthemicrooperationisretired?
Notreadywithmoreuops
Onlyretiringis“usefulwork”
BigDataBench HPCA2019
Pipeline Efficiency
n Top-Down Methodologyn Retiring, Frontend bound, Backend bound, Bad speculation
• Hadoop: notablestallsduetofrontendboundandbadspeculation• Spark: Higher backend bound• AI reflects different bottlenecks
BigDataBench HPCA2019
Frontend Bound
n Frontend latency bound > Frontend bandwidth boundn Latency bound: notablestallsduetofrontendboundandbad
speculationn Bandwidth bound: deliveringinsufficientuops comparingtothe
theoreticalvalue
BigDataBench HPCA2019
Data Motif – Frontend Bound
n Frontend Bound Breakdownn Top 3:branchresteers, instructioncachemiss, MSswitch
• The first reason is the delaystoobtainthecorrectinstructions• MS switch: bigdataandAIsystemsusemanyCISCinstructionsthatcannotbe
decodedbydefaultdecoder
BigDataBench HPCA2019
Data Motif – Backend Bound
n Memory bound (datamovementdelays) > Core boundn Memory bound:L1, L2, L3, external memory boundn Core bound: thelackofhardwareresources or portunder-utilization
BigDataBench HPCA2019
Overview
n Lookingbackathistory
nWhat is DataMotif
nCharacterization of Data Motif
n Impact of Data Input
nConclusion
BigDataBench HPCA2019
Impact of Data Input
Size Pattern Type &Source
BigDataBench HPCA2019
Similarity Analysisn Three data configurations
n Small, Medium, Large
n Sixtymetricsspanningsystemandmicro-architecture
n MeasuringSimilarityn PCAn Hierarchicalclustering
BigDataBench HPCA2019
BigDataBench HPCA2019
Size Impact on I/O Behaviors
n I/O Bandwidthn UsingtheI/ObandwidthofSmalldatasizeasbaseline,wenormalize
theI/ObandwidthofMediumandLargedatasize
BigDataBench HPCA2019
Size Impact on Pipeline Behavior
n Datasizeincreasesà frontendbounddecrease, backendboundincrease
BigDataBench HPCA2019
Impact of Data Input
Size Pattern Type &Source
BigDataBench HPCA2019
Impact of Data Pattern
n Dense matrix V.S. Sparse matrixn I/O Bandwidth: Sparse < Densen Frontend Stalls: Sparse > Dense
BigDataBench HPCA2019
Impact of Data Input
Size Pattern Type &Source
BigDataBench HPCA2019
Impact of Data Type and Source
n Un-structured text data & Semi-structuredsequencedatan System:1.12-7.29 differencesn Architecture: text format incurs more backend bound
BigDataBench HPCA2019
Outline
n Summary of Micro Benchmark
n Micro Benchmark Characterization
n Conclusion
BigDataBench HPCA2019
Conclusion
n Website:n http://www.benchcouncil.org/benchmarks.htmln http://www.benchcouncil/BigDataBenchn http://prof.ict.ac.cn/BigDataBench
n Micro benchmarkn Single data motif implementation
BigDataBench HPCA2019