databench d3.1 ver.1.0 - big data benchmarking · of the big data benchmarking community. the goal...

42
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 780966 Evidence Based Big Data Benchmarking to Improve Business Performance D3.1 DataBench Architecture Abstract This document provides an overview of the DataBench Toolbox Architecture and main functional elements. The DataBench Toolbox aims to be an umbrella framework for big data benchmarking based on existing efforts in the community. It will include features to reuse existing big data benchmarks into a common framework, and will help users to search, select, download, execute and get a set of technical and business indicators out of the benchmarks’ results. The Toolbox is therefore one of the main building blocks of the project and the main interaction point with the users of benchmarking tools. This document provides the architectural foundations and main elements of the tooling support to be used by big data benchmarking practitioners. In this sense, the document gives an overview of the different elements of DataBench ecosystem to contextualize the significance of the Toolbox, as well as details about the different components of the Toolbox identified so far, and hints about their potential implementation. This document is the first deliverable related to the DataBench Toolbox. Updates to the architecture will be provided as integral part of the different releases of the Toolbox expected in the DataBench WP3 lifecycle.

Upload: others

Post on 12-Aug-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 780966

EvidenceBasedBigDataBenchmarkingtoImproveBusinessPerformance

D3.1DataBenchArchitecture

AbstractThis document provides an overview of the DataBench Toolbox Architecture andmainfunctionalelements.TheDataBenchToolboxaimstobeanumbrellaframeworkforbigdatabenchmarkingbasedonexistingeffortsinthecommunity.Itwillincludefeaturestoreuseexisting big data benchmarks into a common framework, andwill helpusers to search,select, download, execute and get a set of technical and business indicators out of thebenchmarks’results.

The Toolbox is therefore one of the main building blocks of the project and the maininteraction point with the users of benchmarking tools. This document provides thearchitecturalfoundationsandmainelementsofthetoolingsupporttobeusedbybigdatabenchmarkingpractitioners.Inthissense,thedocumentgivesanoverviewofthedifferentelementsofDataBenchecosystemtocontextualizethesignificanceoftheToolbox,aswellasdetailsaboutthedifferentcomponentsoftheToolboxidentifiedsofar,andhintsabouttheirpotentialimplementation.This document is the firstdeliverable related to theDataBenchToolbox. Updates to thearchitecture will be provided as integral part of the different releases of the ToolboxexpectedintheDataBenchWP3lifecycle.

Page 2: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

2

KeywordsBenchmarking, big data, big data technologies, architecture, business performance,performancemetrics,toolbox,usecases

DisclaimerThisdocumentreflectstheauthorsviewonly.TheEuropeanCommissionisnotresponsibleforanyusethatmaybemadeoftheinformationthisdocumentcontains.

CopyrightNoticeCopyrightbelongstotheauthorsofthisdocument.Useofanymaterialsfromthisdocumentshouldbereferencedandisattheuser'sownrisk.

DeliverableD3.1 DataBenchArchitecture

Workpackage WP3

Task 3.1

Duedate 30/06/2018

Submissiondate 07/07/2018

Deliverablelead ATOS

Version 1.0

Authors ATOS(TomásPariente,IvánMartínezandRicardoRuiz)GUF(TodorIvanov)

SINTEF(ArneBerre)

POLIMI(ChiaraFrancalanci)

Reviewers RichardStevens(IDC)

Page 3: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

3

TableofContentsExecutiveSummary............................................................................................................................................................................5

1. Introduction...............................................................................................................................................................................6

2. DataBenchinanutshell.........................................................................................................................................................7

2.1 DataBenchelements....................................................................................................................................................7

2.2 TheDataBenchToolbox..............................................................................................................................................8

2.3 Benchmarksintegration.............................................................................................................................................9

2.4 BusinessKPIderivationprocess...........................................................................................................................12

2.5 TheMachineLearningframework........................................................................................................................14

3. DataBenchprocesses............................................................................................................................................................16

3.1 DataBenchactors........................................................................................................................................................16

3.2 DataBenchUMLUseCases.......................................................................................................................................16

3.2.1 DataBenchAccessingusecase..........................................................................................................................16

3.2.2 DataBenchKPIsmanagementusecase..........................................................................................................17

3.2.3 DataBenchUserintentionsandSetup/Runtimeusecases....................................................................19

3.2.4 DataBenchVisualizationandReportingusecase......................................................................................20

3.2.5 DataBenchBenchmarkManagementusecase............................................................................................21

3.3 DataBenchcomponents............................................................................................................................................23

3.3.1 AccessingSubsystem...........................................................................................................................................25

3.3.2 UserIntentionsSubsystem................................................................................................................................26

3.3.3 SetupandRuntimeSubsystem.........................................................................................................................26

3.3.4 AnalyticsandKPIManagementSubsystem.................................................................................................28

3.3.5 BenchmarkManagementSubsystem.............................................................................................................29

3.3.6 VisualizationandReportingSubsystem........................................................................................................30

4. DataBenchToolboxarchitecture......................................................................................................................................31

4.1.1 Webinterface..........................................................................................................................................................32

4.1.2 Benchmarkingframework.................................................................................................................................32

4.1.3 Resultsinterface....................................................................................................................................................33

4.1.4 Resultsparser.........................................................................................................................................................33

4.1.5 MetricsSpawning..................................................................................................................................................33

4.1.6 ResultsDB................................................................................................................................................................34

4.2 RelationtotheBDVAReferenceModel...............................................................................................................34

4.3 Visionandpotentialimplementation..................................................................................................................37

4.3.1 Phase1:Alphaversion........................................................................................................................................37

4.3.2 Phase2:Finalversion..........................................................................................................................................40

5. Conclusions..............................................................................................................................................................................41

6. References................................................................................................................................................................................42

Page 4: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

4

TableofFiguresFigure1-FunctionalviewoftheDataBenchecosystem.......................................................................................................7

Figure2.MatrixwithBigDataBenchmarks............................................................................................................................10

Figure3.DetailedClassificationoftheAssociatedBenchmarks.......................................................................................11

Figure4.PreliminarysetofblueprintstyingbusinessKPIswithtechnicalbenchmarks........................................13

Figure5.Accessingusecasediagram........................................................................................................................................17

Figure6.TechnicalandbusinessKIPsmanagementusecasediagram.........................................................................18

Figure7.UserintentionsandSetup/Runtimeusecases....................................................................................................19

Figure8.VisualizationandReportingusecasediagram.....................................................................................................21

Figure9.DataBenchBenchmarkmanagementusecasediagram....................................................................................22

Figure10.DataBenchcomponentsdiagram............................................................................................................................24

Figure11.Accessingsubsytem.....................................................................................................................................................25

Figure12.UserIntentionssubsytem.........................................................................................................................................26

Figure13.Setupandruntimesubsytem..................................................................................................................................27

Figure14.KPIsmanagementsubsytem....................................................................................................................................29

Figure15.Benchmarkmanagementsubsystem.....................................................................................................................29

Figure16.VisualizationandReportingsubsytem.................................................................................................................30

Figure17.Functionaloverviewoftheframeworkarchitecture.......................................................................................31

Figure18.BigDataValueReferenceModelwiththeAssociatedBenchmarks............................................................34

Figure19.DataBenchToolboxpotentialimplementation..................................................................................................37

Figure20.Benchmarkorchestratordetail..............................................................................................................................38

Figure21.-Resultsprocessing......................................................................................................................................................39

Page 5: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

5

Executive Summary ThisdocumentpresentsanoverviewoftheDataBenchecosystemwithspecialemphasisontheinitialspecificationoftheDataBenchToolboxarchitecture.Tothisextent,thedocumentputstheDataBenchToolboxinthecontextofthemainfunctionalelementstobedeliveredbythedifferentWorkPackagesoftheproject.Inordertodothat,thedocumentgivesanoverviewofthedifferentelementsoftheecosystem,namelytheidentificationofpotentialbenchmarkstobeintegratedintotheToolboxandalignmenttotheBDVAReferenceModel,thebusinessreportsabouttheelementsofindustrialsignificanceforbenchmarkingbigdatasolutions, the identification of business KPIs and the project Handbook, along with thesupportingArtificial Intelligencemethods to be delivered to recommend and generaliseaspectsontheselectionofbenchmarksorderivationofKPIs.

Besides the descriptionof the context, the core of the document is the definitionof thearchitecture of theDataBenchToolbox. TheDataBenchToolbox aims to be an umbrellaframeworkforbigdatabenchmarkingbasedonexistingefforts inthecommunity. Itwillincludefeaturestoreuseexistingbigdatabenchmarksintoacommonframework,andwillhelpuserstosearch,select,download,executeandgetasetofhomogenizedresults.TheDataBenchToolboxwillbeanintegralpartoftheDataBenchFramework,whichultimatelywilldeliverrecommendationsandbusinessinsightsoutofbigdatabenchmarks.Therefore,apotentialarchitecturealongwiththepresentationofUnifiedModellingLanguage(UML)use cases to represent the processes associated to the architecture are depicted andexplained. Ideas, phases and potential technologies to implement the Toolbox are alsoexplained.

Page 6: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

6

1. Introduction ThisdocumentpresentstheDataBenchToolboxinitialarchitecture.TheDataBenchToolboxwill include formalmechanisms to ease the processof reusing existing or newbig databenchmarksintoacommonframeworkthatwillhelpstakeholderstofind,select,download,executeandgetasetofhomogenizedmetrics.TheDataBenchToolboxwillbeanintegralpart of the DataBench framework, which ultimately will deliver recommendations andbusinessKPIsoutofbigdatabenchmarks.Thepresentdocument therefore startsbyputting theDataBenchToolbox in the contextwiththerestoftheDataBenchframeworktolaterondiveintothedetailsoftheenvisagedarchitecture.ItisimportanttonoticethattheToolboxwillbebasedonexistingeffortsonbigdatabenchmarking,ratherthanproposingnewbenchmarks.TheDataBenchToolboxthereforeaimstobeanumbrellaframeworkforbigdatabenchmarking.Theideabehinditis to provide ways to declare new benchmarks into the Toolbox and provide a set ofautomatismsandrecommendationstoallowtheusageofthesetoolstobecomepartoftheecosystem.Duetothedifferentnatureandtechnicalscopeoftheexistingtools,thedegreeof automationmayvary fromone tool toanother.Thebaselinewillbe thepossibilityofdownloading the selected benchmarking tools from theToolboxweb user interface andprovideadapterstoactuallygettheresultsofthebenchmarkingintotheDataBenchToolboxinordertogethomogenizedmetrics.Othertoolsmaybesubjecttotighterintegrationandevenautomationofthedeploymentinthebenchmarkingsystem.

Thedocumentisstructuredasfollows:

• Section1providestheintroductiontotheobjectivesofthedeliverable.• Section2describestheDataBenchToolboxinthecontextoftherestoftheDataBench

ecosystem,providingafunctionalviewofthedifferentcomponents.• Section 3 dives into the processes envisaged in the usage of the DataBench

ecosystem,withspecialdetailintheonesrelatedtotheDataBenchToolbox.• Section 4 is the core section of the document, where the first ideas about the

architectureoftheDataBenchToolboxareexplained.• Finally,Section0providestheconclusionsofthedocumentandoutlinesthefuture

workontheDataBenchToolbox.

Page 7: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

7

2. DataBench in a nutshell 2.1 DataBench elements

TheDataBenchprojectiscomposedofseveralWorkPackages,eachofwhichisinchargeofdeliveringasetofresultsofdifferentnature(software,reports,handbooks,etc.).Fromthetechnicalperspective,theDataBenchToolbox,tobedeliveredinWP3,isthemainsoftwareelementthatwillbeprovidedtotheusers,butitisnotinisolationasdepictedinFigure1.

Figure1-FunctionalviewoftheDataBenchecosystem

ThedifferentelementsoftheDataBenchecosystemlistedinFigure1canbedividedintoasetofmajorelementsoftheFramework:

• TheDataBenchToolbox:DevelopedinthescopeofWP3andthemainsubjectofthisdocument,theDataBenchToolboxisthecoretechnicalcomponentoftheDataBenchFramework.Itwillbetheentrypointforusersthatwouldliketoperformbigdatabenchmarkingandwillultimatelydeliverrecommendationsandbusinessinsights.Itwillincludefeaturestoreuseexistingbigdatabenchmarks,andwillhelpuserstosearch,select,download,executeandgetasetofhomogenizedresults.TheToolboxtakesasinputmostoftheworkdoneintherestoftheproject’sWorkPackages.

• ThebenchmarkstobeintegratedintotheToolbox:Locatedontheleft-handsideofFigure 1, external big data benchmarks are the input to the whole DataBenchFramework,andinparticulartotheDataBenchToolbox.Thesebenchmarkswillbemade available to the users from the Toolbox user interface, along with thepossibility toget theresultsof thebenchmarkingexecutionrunsandhomogenize

Page 8: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

8

themintoacommonmodelinordertogethomogenizedsetofmetrics.ThedegreeofintegrationoftheToolboxwiththeexistingbenchmarkswillvarydependingontheintegrationissuesthatmayarise.

• The process to derive business KPIs: Deriving business KPIs from the results ofrunningbigdatabenchmarksisnotastraightforwardprocess.Itwilldependonthebusinesscontextand,therefore,itisnotcompletelycleartowhatextentitcouldbeautomated.However,differentapproachestoderivebusinessinsightsarecurrentlyunder study and will be reported in future deliverables. This process is acollaborativework among the differentWork Packages.WP1 is investigating thedifferent existing benchmarks in order to devise a coherent framework andproviding input on available technical KPIs and mappings to existing big datareferencemodelsandtothedatavaluechaininordertoclusterthem.ThisworkwillbeappliedinWP4tothedeskanalysisandtothein-depthcasestudies,whichwillresultintoananalysisoftherelationsbetweentechnicalandbusinessKPIs,usecasesandaHandbooktoguideuserstoselectandinterprettheresults.WP2,ontheotherhand,willprovideguidanceintermsofmarketvalue.

• TheAIframework:WP5willprovideaMachineLearningframeworkthatwillserveto enable recommendations for the benchmarking community based on pastexperiences.ItwillbeintegratedwiththeToolbox.

TheseelementsoftheFrameworkaredescribedinmoredetailinthefollowingsections.

2.2 The DataBench Toolbox TheDataBenchToolboxisthecoretechnicalplatformtobeprovidedbyDataBenchtousersof the big data benchmarking community. The goal of the Toolbox is to provide toolingsupporttobigdatabenchmarkinguserstoontheonehand,search,select,deployandrunexisting big data benchmarks, and on the other hand, get the results of the execution,homogenizethetechnicalmetrics,andfinallyhelptoderivebusinessinsightsandKPIs.Itisthereforea toolbasedonexistingeffortsonbigdatabenchmarking, takingadvantageofyearsofresearchandpracticeaccumulatedinthecommunity.Therefore,theToolboxwillbe based on existing efforts on big data benchmarking, rather than proposing newbenchmarks.TheToolboxwillofferwaystoreusenewbenchmarksandprovideasetofautomatismsandrecommendationstoallowtheusageofthesetoolstobecomepartoftheecosystem.ThesebenchmarkswillbemadeavailabletotheusersfromtheToolboxuserinterface,alongwiththepossibilitytogettheresultsofthebenchmarkingexecutionrunsandhomogenizethemintoacommonmodelinordertogethomogenizedsetofmetrics.ThedegreeofintegrationoftheToolboxwiththeexistingbenchmarkswillvarydependingontheintegrationissuesthatmayarise.

From the usage perspective of benchmarking tools, DataBench has detected two verydifferentscenarios.Ontheonehand,thereareorganizationsthatdonothaveanyspecificprivacyorsecurityconstraintstorunbigdatabenchmarks,andthereforethesettingofthebenchmark may happen in public infrastructure. On the other hand, there are otherorganizations and industries that will only benchmark their big data environment in asecludeinfrastructure,certainlynotpublicandwithoutanyInternetconnectionopen.ThismeansthattheToolboxshouldcaterforbothscenariosandbeabletoenablethesettingofthebenchmarkscompletelyofflineiftheuserdecidesso.

Page 9: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

9

Therefore,theToolboxwilloffertwodifferentdeploymentstrategiesorToolboxflavours.The first one, the sandbox deployment, will provide a way to create the necessarydeployment recipes from the Toolbox server-side installation. Thiswill allowusers, forinstance, todefinewhere (i.e. IP andportnumbers) thebenchmarks selected shouldbedeployed.Thisisverywellsuitedfordeploymentsinpubliccloudsorpublicinfrastructure,whereusersdonotimposeverytightsecurityconstraintstorunthebenchmarks.Ontheotherhand, theToolboxwillofferaseconddeploymentoption,or in-housedeployment.This is more suited for users that would like to run the benchmarks in their owninfrastructureorwherethesystemtobenchmarkresides.Itusuallyinvolveshighersecurityandprivacyrequirements.Inthiscase,asmanyotherbenchmarkingtoolsdo,aversionoftheDataBenchToolboxwillbedownloadedaftertheselectionofthebenchmarktoexecutethat will help the user to setup the benchmark and later, give back the results of theexecutions.ThisfeedbackloopisoneofthekeyaspectsoftheToolbox,astheideaisthattheresultsfromthebenchmarkrunswillbereportedbacktotheToolboxattheDataBenchserver-sideto homogenize the technicalmetrics and derive a set of business insights orKPIs. Thisprocessisexplainedinsection2.4butitwillbesupportedbytheToolboxandshouldbeenabledforbothdeploymentstrategies.The processes associated to the Toolbox are explained in detail in section 3, while thearchitecture of the Toolbox itself and the current potential implementation ideas areexplainedinsection4.

2.3 Benchmarks integration Anessentialpartof theDataBenchproject is to identifyandclassifyallexistingBigDatabenchmarksandtheircommunities.Thiswork is inprogressaspartofWP1andwillbereportedinfullinthescopeofthatWorkPackage.Inthissection,theideaistohighlightthecurrentstatusoftheanalysisofthemajorbenchmarkingeffortsthatDataBenchisdoing,withthehopetoidentifythecandidatebenchmarkstobeintegratedintotheToolbox.Figure2 is a current snapshot of a matrix positioning the Big Data benchmarks according todifferent criteria defined in the BDVA Reference Model (see section Error! Referencesourcenotfound.formoredetails)[BDVSRIA].Ontheleft(inblue)arelistedthedifferentindustryapplicationdomains,datatypesandtechnologyareas.Atthebottom(ingreen),allthedifferentBigDatabenchmarksarelistedinreleasetimeorder.

Page 10: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

10

Figure2.MatrixwithBigDataBenchmarks

Page 11: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

11

The six selected benchmarks are divided into two categories: micro-benchmarks andapplicationbenchmarks.Themicro-benchmarksstresstestveryspecificfunctionalitiesofthe Big Data technologies typically utilising synthetic data. The application benchmarkscover a wider and more complex set of functionalities involving multiple Big Datatechnologiesandtypicallyutilisingrealscenariodata.Fromapractitionerperspective,thetwocategoriesofbenchmarksarequitedifferent.Themicro-benchmarksareeasytosetupand use as they typically involve one technology, whereas the application benchmarksrequiremoretimetosetupandinvolvemoretechnologiesintheexecutionphase.

Figure3.DetailedClassificationoftheAssociatedBenchmarks

Figure 3 summarises the associated micro and application benchmarks that we areevaluating. They cover multiple domains and data types as listed in the last two tablecolumns.HiBench[Huangetal.,2010],[HiBenchSuite,2018]isdevelopedbyIntelandisoneoftheveryfirstBigDatabenchmarks.SparkBench[Agrawaletal.,2015],[SparkBench,2018]isdevelopedbyIBMandfocusesonimprovingtheSparkframeworkperformance.YCSB(Yahoo!CloudServingBenchmark)[Cooperetal.,2010],[YCSB,2018]isdevelopedbyYahooandtargetscloudOLTPworkloadstypicalforNoSQLsystems.TPCx-IoT[TPCx-IoT, 2018] is the latest standardised benchmark by the TPC (Transaction ProcessingPerformance Council) inspired by the YCSB benchmark but focusing on IoT Gatewaysystems. Yahoo Streaming Benchmark (YSB) [Chintapalli et al., 2016], [YSB, 2018] is asimple advertisement analytics pipeline application benchmark developed by Yahoo.BigBench [Ghazal et al., 2013], [BigBench, 2018] is an end-to-end, technology agnostic,application-level, analytics benchmark standardised as TPCx-BB by TPC. BigBench V2[Ghazaletal,2017]addressessomeofthelimitationoftheBigBench(TPCx-BB)benchmark.ABench[IvanovandSinghal,2018]isaBigDataArchitectureStackbenchmarkbasedonBigBenchandtargetingtocovermultipleBigDataapplicationscenarios.Itisimportanttooutlinethatimplementationsofallselectedbenchmarksareavailablefordownloadanduse,whichwillallowustointegrateandofferthemaspartoftheDataBenchToolbox.

Page 12: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

12

2.4 Business KPI derivation process

OneofthemajoroutcomesofDataBenchisrelatedtothepossibilityofderivingbusinessKPIsfromtheresultsoftechnicalbenchmarks.However,thisisnotaclearprocessandthereareveryfewattemptsinthebenchmarkingcommunityinthisregard,withtheexception,perhaps,ofTPC1,whichgivesthecostofinfrastructureandenergyconsumptionaspartoftheirresults.ThereisnogeneralmodeltomapbusinessKPIstotechnicalmetrics,andthereisno consensus if a generalmodel couldbeachievedwithout taking intoaccountavastamount of context knowledge. Therefore, themain issue related with themetrics/KPIsmanagementisrelatedtothemappingoftechnicalandbusinessKPIs.Thisbusinessmetricsderivationprocess isunderconsiderationat thetimeofwritingthisdocument.DerivingbusinessKPIsfromtechnicalmetricsissomethingthatcouldbeveryspecificforparticularscenarios,industriesandevensubjecttomanyexternalvariables.Asanexample,arecommendationsystemfortheretailindustryhasapotentialbusinessimpactoncumulativemargin.Fromatechnicalstandpoint,itisdataintensiveandrequiresbenchmarks assessing query throughput and response time. Technical costs can besignificantandbenchmark-driventechnicalchoicescanplayanimportantroleinincreasingcumulativemargin.Inthiscase,thelinkbetweentechnicalbenchmarksandbusinessKPIsisdirect,asthroughputcanbetranslatedintoacost-per-recommendationand,hence,intomargin-per-recommendation. The business KPI derivation process of the DataBenchToolbox can provide a framework allowing companies to input information on theirstandardper-transactionmarginandtocalculatebusinessKPIswithalternativetechnicalchoices.Asanotherexample,inpredictivemanufacturing,streamingdatafromsensorsmonitoringa production plant are analysed to improve product quality in several ways, such asidentifyinganomaliesthatcouldgeneratedefectsand,thus,reducingscrapandreworkatthesametime.Inthiscase,qualitycheckshavetobeperformedinrealtime,consistentlywith the scheduleof theproductionprocess.Technicalbenchmarking tools canhelp thecorrect design of the architecturemanaging streaming data, for example by comparingdifferent technicalalternatives,dependingonthereal-timerequirementsofacompany’sspecificproductionplant.Inthiscase,therighttechnicalchoiceenablesbusinessreturns,however, the linkbetween technical costsand returns isnotdirectbutmediatedby thecomplexityofproductionprocessesandbythespecific featuresofaplant. Inadditiontothis,areductioninscrapandreworkcanbedifficulttotranslateintoeconomicbusinessreturnsand,asamatteroffact,companiesmaydecideonthetechnicalinvestmentwithoutmeasuring business returns. In this case, the business KPI derivation process of theDataBenchToolboxcanprovideaframeworktounderstandthearchitecturalalternativesand focus benchmarking on the most important features of the architecture, while theestimateofbusinessreturnsmaybemoreeffectivelyaidedbyprovidingmethodologicalguidelines (maybe with an online hypertextual/interactive/multimedia access to WP4Handbook).Therefore,thislinkbetweentechnicalbenchmarksandbusinessKPIsstronglydependsontheusecaseand,therefore,willbemanagedwithabottom-upapproach.Theinitialideaisto enable derivation procedures based on very specific scenarios, technicalmetrics and

1http://www.tpc.org/

Page 13: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

13

theirrelationtoafewgeneralbusinessKPIs.Ifpossible,thesystemwilllearnfrompastandnew experiences to try to generalise the derivation process using a machine leaningapproach.However,inmanycases,theactualinterpretationfromthebusinessperspectivecould not be implemented in a simple algorithm or even derived from past learningexperiences. In those cases, the interpretationwill be left to the businessuser after thecarefulVisualizationoftechnicalresultsalongwithothermaterial,suchasbusinessreportsfromWP2. The research work that will be conducted inWP1 will provide a referenceframework for understanding the relationship between business KPIs and technicalbenchmarks. The in-depth case study research conducted inWP4will discuss differentblueprints indetail, showinghow the relationship between businessKPIs and technicalbenchmarksisobtainedinrealcasesandprovidingsuggestionsonhowthisknowledgecanbeembeddedinthebusinessKPIderivationprocessoftheDataBenchToolboxtohelpbothtechnicalandbusinessusers.IntheinitialphasesoftheDataBenchproject,WP1willfocusonthedesignofanintegratedDataBenchFramework,which,basedonacontinuousexchangeofideasandresults,willbeappliedinWP4tothedeskanalysisandtothein-depthcasestudies.Thisresearchworkwillresult into a systematic analysis of the relationship between technical benchmarks andbusinessKPIs,providingaclassificationofusecasesaswellasblueprintsofhowuserscanbe guided in the selection of the most appropriate benchmarks and in relatingbenchmarkingresultstobusinessreturns.Figure4showsapreliminarysetofblueprintstyingbusinessKPIswithtechnicalbenchmarks,usedbytheDataBenchteamasabasisfordiscussion.

Figure4.PreliminarysetofblueprintstyingbusinessKPIswithtechnicalbenchmarks

Page 14: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

14

As amatter of example, the following businessKPIs give a hint of the difficulty tomaptechnicalresultstobusinessimpactinastraightforwardway:i)Costreduction(itisveryrelatedtotheplatformitself,sothereforedifficulttogeneralise),ii)timeefficiency(itcouldbeeasytomeasurethegainintimeforsomeprocesses,butnotsoclearhowthisrelatestothe business objectives except in a few selected scenarios), iii) product/service quality(amongothers,scalabilityorfaulttolerancearetechnicalmetricsthatmanybenchmarksusuallyprovide,butitisnotclearhowtorelatethemtoquality,andtheyaremeasuredindifferentwaysdependingof thebenchmark);or iv)revenuegrowth (which isnotreallypossibleorsimpletorelatetoagivensystem).In summary, the derivation of businessKPIswill involve a bottom-up, scenario, sector-basedapproach,andatop-down,generic,MachineLearning-basedapproach.Wehopethatthecombinationofthesederivationprocess,wherepossible,withworkonhomogenisation,clusteringandVisualizationoftechnicalKPIs,plustheusageofothernon-technicalresultsoftheproject(suchasmarketmaterialandtheHandbookprovidedbyWP1,WP2andWP4),willgiveclearbusinessinsightstotheDataBenchusers.

2.5 The Machine Learning framework

Thepurposeof theMachineLearning(ML) framework is to learnrelationshipsbetweentechnical decisions and business outcomes for Big Data projects. This framework willrecommend optimal, resource-efficient, objective-maximising technical infrastructure toachievethebigdata-relatedgoalsoftheorganization.Briefly,previouslybenchmarkedtechnicalprojectsaredescribedbyavarietyoftechnicalandbusinessindicators("features"),allowingstatisticalanalysisandmodellingofthemostsuccessfulapproachestosolvingparticularbusiness-orientedbigdataproblems.Initially,models will be trained and deployed based on a wide variety of openly available datarepositories(the"boot-strapping"phase).Asuse-casesandtechnicalsurveyresultsbecomeavailable, metrics will be updated and iteratively incorporated into the ML modellinginfrastructure, allowing the automated algorithms to provide more and more relevantrecommendationsonawidevarietyoftechnicalbigdataproblemdomains.To get an initial set of annotated data, the project will investigate and mine differentrepositories such as the one provided by the Aloja2 benchmarking framework, amongothers.That is, the Machine Learning framework receives annotated data and technicalinfrastructuredecisions,associatedwithbusinessKPIs.Usersmaythenquerythesystemaccordingtothetypeofproject/goal,andtheirbusinessindicatorsofinterest.Thesystemprovidesoutputrecommendinganoptimalstrategyforapproachingtheproblem.In general, there is an expectation that the Machine Learning framework will receivefeedback on its recommendations from end users, in the form of technical indicatorevaluationmetrics.Asexplainedbefore,anotherpotentialapplicationoftheMachineLearningframeworkisrelatedtothegeneralisationofamodeltoderivebusinessKPIs.Theproblemhereisthelack of data to train the model. We hope that in the course of the project, as initial

2https://aloja.bsc.es/

Page 15: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

15

experimentsprovideexamplesofhowbusinessmetricscanbederivedfromtechnicalones,thisgrowingcorpuswillallowsuchanendeavour.

Page 16: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

16

3. DataBench processes Therearedifferentwaysof representing the functionalityof the softwarewithoutbeingfinished, one of these is the Unified Modelling Language (UML), which is the softwaremodellingsystemmostknownandusedtoday.Itiscomposedofvariousgraphicelementsthatcombinetoformdiagrams.ItisalsosupportedbytheObjectManagementGroup(OMG)thatisdedicatedtothecareandestablishmentofstandardsofobject-orientedtechnologies.Within UML there are different types of diagrams that allow representing the differentperspectivesofasystem,whichisalsoknownasamodelthatisasimplifiedrepresentationofreality.TheUseCasesarediagramsthatallowrepresentingwhatthesystemwilldobutnothowitworks.This section aims to provide theUMLUse Case diagrams, their components and a briefexplanationoftheexpectedprocessestobecarriedoutwithintheentireDataBenchtoolset(Toolbox,AI,etc.).Itisthereforeahigh-levelstep-by-stepdescriptionofwhatDataBenchcoulddoforthetargetusers.

3.1 DataBench actors

Toallowacorrectrepresentationofthemodel,thefirststepconsistsoftheidentificationoftheusersof thesystems,knownasactors inUML.Theactorsinvolved intheDataBenchprocessesarethefollowing:

• DataBench Admin, who adds existing or new benchmarks to the Toolbox andmanagestheserver-sidesystem.

• TechnicalUsers,whocanbedevelopers,administratorsortestersthatwouldliketoperformatechnicalbenchmarkofagivensystem/app.Thesearetheactualusersofthefinalsystem.

• BusinessUsers.Like in thepreviouscase, theseareendusers thatwould liketobenchmarktheirsystemorapp.However,inthiscasetheydonotperformtheactualbenchmark, but they are interested in understanding the underlying businessimplications. These users will rather get into the web tool to search for pastexecutions,gettingrecommendations,businessindicators,etc.

• BenchmarkProviders.Typically,developersorprovidersofbigdatabenchmarks(i.e.peoplebehindHiBench).TheywillbeabletoaddorupdatebenchmarkstotheToolbox.

3.2 DataBench UML Use Cases

Consideringthata“usecase”isadescriptionoftheactionsofasystemfromtheuserpointofview,thissubsectionisinchargeoflistingthedifferentusecasesidentifiedinthecontextoftheDataBenchproject,takingasareferencewhatisspecifiedintheDoA.ThefollowingsubsectionsdescribetheDataBenchusecasediagramsthataimtomodelthefunctionalityoftheDataBenchToolboxusingactorsandusecasediagrams.TheusecaseswillbecomeservicesorfunctionsthatwillbeprovidedbyDataBenchtothetargetusers.

3.2.1 DataBench Accessing use case

TheDataBenchaccessingusecasediagramshowninFigure5isinchargeofmanagingtheaccessofuserstotheDataBenchplatform.

Page 17: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

17

Figure5.Accessingusecasediagram

Theprocessesorfunctionalitiesidentifiedarethefollowing:

• DefineUserProfile:Theprocessallowsthecreationofauserprofilefollowingthetypologyoftheactorsofthesystem:Administrator,TechnicalUser,BusinessUserorBenchmarkProvider.

• GrantPermissions:TheprocesswillallowtheDataBenchAdministratortograntasetofpermissionstoacertainuserprofile.

• CreateAccount:Theprocessestablishesthemechanismforcreatinganaccountandassociateittoanykindofuserprofile.

• Access to DataBench Toolbox: This functionality will support the access to thesystem.Accesstosomepartsofthesystemwillbeprivateandwillincludea“Login”process.Forotherpublicpartsof the system, it is expected that loginwillnotbenecessary,thereforeallowingguestusers.

• GetUserProfiles:Theprocessaimstogetthedifferenttypesofuserprofilesavailableinthesystem.

3.2.2 DataBench KPIs management use case

TheDataBench use case diagram related to technical and businessKPIsmanagement isshowninFigure6.

Page 18: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

18

Figure6.TechnicalandbusinessKIPsmanagementusecasediagram

Theprocessesorfunctionalitiesidentifiedcouldbeclassifiedinthreemaingroupsfromabusinesspointofview:

• TechnicalKPIsmanagement:Thisgroupoffunctionalitiesinvolvestheprocessesofsetting and managing the technical KPIs used by the existing benchmark in acommonDataBenchKPIframework.Itwillprovidefunctionalitiesfori)addinganewtechnicalKPI,ii)updateanexistingtechnicalKPI,iii)listingtechnicalKPIsassociatedtoaconcretebenchmark,iv)getanindividualtechnicalKPIsandfinallyv)deleteatechnicalKPI.TheTechnicalUseristheactorthatperformsthesefunctionalities.Inaddition,aharmonisationofthetechnicalresultsiscarriedoutaspartofsetupandruntime use case (see subsection 3.2.3) to resolve the mappings between thetechnicalKPIsfromagivenbenchmarktothesimilarKPIsinDataBench.

• BusinessKPIsmanagement:Thisgroupoffunctionalitiesinvolvestheprocessesofi)derivation of a new business KPI using the associated derivation procedure, ii)addinganewbusinessKPI,iii)updateanexistingbusinessKPIs,iv)listingbusinessKPIs associated to a concrete benchmark, v) get an individual business KPI and,

Page 19: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

19

finally, v) delete a businessKPI. TheBusinessUser is the actor involved in thesescenarios.

• KPIsderivationproceduresmanagement:Thisgroupofprocessesiscoveringthei)definition of a newderivation procedure, ii) the update of an existing derivationprocedureandfinally iii) thedeletionofaderivationprocedure.As in thecaseofbusiness KPIs management, the Business User is the actor involved in thesescenarios.

3.2.3 DataBench User intentions and Set up/Runtime use cases

Figure7showsacombinedusecasediagramthatreflectstherelationshipbetweentheUserintentionsandSetup/runtimeusecases.

Figure7.UserintentionsandSetup/Runtimeusecases

ByUser intentions, this processmeans the identification and categorizationof themostsuitable benchmark that fitswith the searching criteria specified by theTechnicalUser.ThesecriteriarefertotheBigDatahorizontalarea,verticaldomainanddatatypeandwillbeextendedwithasetofgoalmetricsrelatedtothepurposeofthebenchmark.

Page 20: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

20

Ontheonehand,theUserintentionsusecaseaimsatdefiningtheprocessesrelatedtohelptheendusersoftheDataBenchToolboxtosearchandselecttherightbenchmarksthatsuittheirneeds.Theusecaseincludestheprocessesfori)userintentions’specificationandii)DataBench benchmark matching based on the user intentions. To specify the userintentions,thespecificationprocesscouldbeextendedperformingsomequeriesoverthebenchmarkcatalogueinordertogettheassociatedmetadatatosupporttheformalizationof the intentions.Thisprocessmayalso take advantageof theworkdone inWP5usingArtificial Intelligence (AI) techniques to deliver recommendations based on specificfeatures.Ontheotherhand,theSetup/Runtimeusecaseencompassestheprocessesofi)settingupandconfigureaDataBenchassociatedbenchmark,ii)executingaconcretebenchmark,iii)the process of injection of results of the executed benchmark either automatically ormanuallybymeansofupdatingasetofresultfilesandfinallyiv)theharmonizationofthetechnicalresultsinaDataBenchformat.Theexecutionprocesscouldbeextendedusingtwodifferentdeploymentprocesses:

1. Sandbox deployment: This approach will help users to deploy the DataBenchbenchmark and their applications into a testing environment outside their ownorganization. Following this approach, a Technical User should communicate thenecessaryinfototheDataBenchplatformtolettheToolboxknowthedetailsoftheinfrastructurewherethedeploymentwillbedone(i.e.servers,credentials,etc.)Thisisuseful, for instance, forusers thatwould liketobenchmarksomething incloudresources(Amazon,Azure,Google,etc.).TheusershoulddeclaretotheToolboxthehostsandinfrastructurewherethedeploymentshouldbedoneattheserverside.

2. In-housedeployment:ThisprocesswillprovidethemeanstodownloadtheToolboxasanexecutable client-side framework to the infrastructureof theuser (i.e. theirproductionortestingin-houseinfrastructure,aprivatecloud,etc.).ThiscanbedoneforinstancebypreparinganAnsible3playbook,usingaclientuserinterface(i.e.adashboard,wizard,orscripts)tosetitupinthelocalresources,executetheplaybookandallowrunningseveralbenchmarkrounds.

In both cases, the Toolboxwill provide themeans to get the results of the runs of thebenchmarksbacktotheToolbox.TheusermaydecidewhichresultstouploadtoDataBench,buttheideaistoenableandrecommendtheautomaticcollectionofresultsaspreferredoption. This is important to be able to access more advanced features, such as thepublicationofresultsintotheDataBenchplatform,andtoaccessmoreinformationaboutpotentialbusinessKPIs.

3.2.4 DataBench Visualization and Reporting use case

TheDataBenchvisualizationandreportingusecasediagramshowninFigure8isinchargeof covering the visualization of the benchmark execution results and derived businessmetrics.

3https://www.ansible.com/

Page 21: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

21

Figure8.VisualizationandReportingusecasediagram

Thisusecasereliesonthecommunicationoftheresultsoftheexecutionofthebenchmarksexplainedintheprevioususecase.TheTechnicalUserwillbeabletovisualizethetechnicalresultswhiletheBusinessUserwillbeabletoaccesstopotentialderivedbusinessmetrics.In both processes theTechnicalUser and theBusinessUser could look for other publicresults and define filters based on similarity or comparative criteria in the process ofqueryingtheresults.

3.2.5 DataBench Benchmark Management use case

ThisusecaseprovidesanoverviewofthedifferentprocessesrelatedtothemanagementofthebenchmarksandmetricshandledinDataBench.Assuch,itoffersprocessesforadding,updating or deleting benchmarks to the Toolbox along with the necessarymetadata todescribeordeploythem.ItalsoenablesthepopulationofthecataloguesoftechnicalandbusinessKPIsandthepotentialdefinitionofderivationprocedures.Insummary,thisusecaseisfundamentaltoprovideextensibilitytotheToolbox.TheusecasediagramrelatedtoBenchmarksmanagementisshowninFigure9.

Page 22: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

22

Figure9.DataBenchBenchmarkmanagementusecasediagram

Nextarelistedthefollowingprocessesorfunctionalitiesidentified:

• Add new Benchmark: This process is in charge of uploading into the DataBenchplatformallthenecessaryresourcesforasubsequentexecutionofthebenchmark.Theseresourcesinclude:

o Benchmarksources:Thesesourcesincludei)DataGenerator,ii)Workloadsandiii)ResultsInjector.

o Benchmarkconfigurationo DeploymentRecipes

Page 23: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

23

• UpdateaBenchmark:Theprocessallowsupdatingthesources,configurationor/anddeploymentrecipesassociatedtoaconcretebenchmark.

• DeleteaBenchmark:ThisfunctionalitygivestotheDataBenchAdminthepossibilitytodeleteanexistingbenchmarkandtheassociatedresourcesfromtheDataBenchplatform.

• InitializeData/metadatacatalogues:TheprocessenablestheDataBenchAdmintoinitializethedifferentstoresorcataloguesdefinedintotheDataBenchplatformas:TechnicalKPIscatalogue,BusinessKPIscatalogue,DeploymentRecipescatalogue,BenchmarkcatalogueandUserscatalogue.

• SearchinData/metadatacatalogues:ThisfunctionalityletstheBenchmarkProviderandDataBenchAdminperformqueriestogetsources,configurationordeploymentrecipesofaregisteredbenchmark.

3.3 DataBench components

ThediagramsinUMLareclassifiedaccordingtotheirstructureorbehaviour,thelatterrefertothewayinwhichinstructionsareexecutedorhowactivitiesaregivenwithinthesystem.The structurediagramsdefinehowthe software is structured.Anexampleof this is thecomponentdiagramthatdefinesthedifferentpartsthatmakeupthesystemtowork.ThissectionaimstoprovidetheDataBenchcomponentdiagramshowingthedifferentbitsandpieces,andsubsystemstobeexplainedafterwardsinthefollowingsub-sections.TheglobalDataBenchcomponentdiagramisshowninFigure10.

Page 24: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

24

Figure10.DataBenchcomponentsdiagram

Page 25: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

25

3.3.1 Accessing Subsystem

TheaccessingsubsystemshowninFigure11willbeinchargeoftheauthentication,accessauthorization and auditing. To support these functionalities the accessing subsystem iscomposedoffivesoftwarecomponents:

• UserAccountManager:ThiscomponentwillallowtheDataBenchAdmin,DataBenchUserorDataBenchProvidertocarryoutoneof the followingactions: i)createanaccount,ii)updateinformationassociatedwithanaccount,iii)getalltheinformationassociatedtoaconcreteaccountorfinallyiv)deleteanaccount.

• AccessController:Thisasset is inchargeofallowingordenyingtheaccessto theToolboxtoanyoftheactorsdefinedinsection3.1.Thecontrollerwillcommunicatewith the user account manager to obtain the information associated to the userprofileregardinggrants.

• GrantPermissionsManager:ThiscomponentwilllettheDataBenchAdmingrantinga set of permissions to a user profile. Initially, we have in mind three types ofpermissionsovertheactionsandassetsthatwillbedefinedintotheToolbox:read,writeandexecute.

• User ProfileManager: This assetwill allow theDataBenchAdmin to perform theCRUD(Create,Read,Update,andDelete)operationsformanagingtheuserprofilesintheToolbox.

• User/GrantsCatalogue:Thisassetisaninternalstorageinwhichthedatarelatedtouseraccounts,userprofilesandgrantsassociatedwiththeprofileswillbestored.

Figure11.Accessingsubsystem

TheaccesscontrolapproachdefinedforDataBenchcoversaccessapproval,implementingthesystemthedecisiontograntorrejectanaccessrequestfromanalreadyauthenticatedactor, based on what the actor is authorized to access. The authorized actions will berepresentedbymeansoftheuserprofilesandthegrantsassociatedtothem.

Finally, inDataBench theauthenticationandaccess controlwillbe combined ina singleoperationandtheaccesswillbeapprovedbasedonsuccessfulauthenticationorbasedonananonymousaccessinthecaseofguestusers.

Page 26: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

26

3.3.2 User Intentions Subsystem

TheUserIntentionssubsystemshowninFigure12willbeinchargeofenablingtheuserstospecifytheirrequirementsrelatedtotheusageofabigdatabenchmark.TosupportthesefunctionalitiestheUserIntentionssubsystemiscomposedoftwosoftwarecomponents:

• RequirementsModeler:Thiscomponentwillallowtechnicaluserstodefineasetofinitialrequirementstoselectthemostappropriatebenchmarktosuittheirneeds.Initially,theintentionsoftheuserwillbealignedtothedifferentlayersoftheBDVAReference Model specified in the BDVA Strategic Research & Innovation Agenda(SRIA) 4 . This means that the intentions should specify the concrete Big Datahorizontalarea,verticaldomainanddatatype.Thisinitialsetwillbeextendedandaligned with the results of WP5 in terms of recommendation engine based onMachine Learning, and with a set of goal metrics related to the purpose of thebenchmark.

• BenchmarkMatcher:ItwillbeinchargeofcoveringthematchingprocessbetweentheuserrequirementsandthebenchmarkregisteredintheBenchmarkscatalogue(seesection3.3.5).TheMatcherwillprovideasetofbenchmarksthatfitswiththeUserIntentionsspecified.

Figure12.UserIntentionssubsystem

The selection of the benchmarkwill be done by the Technical User once the candidatebenchmarkshavebeen listedbymeansof theuser interfacethatwillbeprovided intheToolbox.

Finally,adecisiontreeorotherMachineLearningapproacheswillbeexploredastechnicalimplementationfortheBenchmarkMapper.Forinstance,inthecaseofdecisiontrees,thevariablesproposedformodellingtheuserrequirementscomingfromtheBDVAReferenceModelcouldbeusedasdecisionnodes.

3.3.3 Setup and Runtime Subsystem

TheSetupandRuntimesubsystemshowninFigure13willbeinchargeofsupportingtheoverall workflow associated to the deployment and execution of one of the benchmarkregisteredintheToolbox.Theworkflowwillincludethreemainsteps,namelyi)setupandconfiguration;ii)deploymentandexecution;andfinally,iii)injectionoftheresultsintothe

4http://bdva.eu/sites/default/files/BDVA_SRIA_v4_Ed1.1.pdf

Page 27: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

27

Toolbox.Tosupportthesefunctionalities,thesetupandruntimesubsystemiscomposedoffoursoftwarecomponents:

• ConfigurationManager:ThiscomponentwillallowaTechnicalUsertouploadasetofconfigurationparametersassociatedtothebenchmarkthataimstoexecute.

• BenchmarkOrchestrator: It is theorchestratorengine in chargeofdeployingandexecuting the benchmark taking the configuration parameters defined by theTechnical User. The execution of the benchmark could be explicitly performedautomatically or, by contrary, fired manually from the Technical User from thebenchmarkenvironmentdeployed.

• Benchmark Deployer: This asset is in charge of the physical provision of theexecutionenvironmentrequestbytheBenchmarkexecutoraspartofabenchmarkexecution.

• ResultsInjector:Thiscomponentisinchargeofinjectingtheresultsoftheexecutionof the benchmark runs into the Toolbox, either fired by the user manually orautomatically,aslaststepofthebenchmarkexecutionworkflow.EachbenchmarkinDataBenchshouldhavetheirowninstanceofthismoduletailoredtogettheresultsfromthespecificformatofthebenchmarkintoDataBench.

Figure13.Setupandruntimesubsystem

ItisimportanttomentionthatDataBenchdoesnotprovideinfrastructureforexecutingthebenchmarks,butratheracentralizedplatformfromwheretheToolboxcanbeaccessedviawebuserinterfaceenablingtwopotentialscenariostocoverthedeploymentandexecutionstepsaftertheselectionofagivenbenchmark:

1. Sandboxdeployment-Executioninpublicresources.Thisscenarioisusefulforusersand organizations that would like to benchmark a big data solution in publicresources or infrastructure (i.e. a public cloud) with no big security or privacyconstraints. In thiscase, theTechnicalUsershoulduploadsomedetailsabout theexecutionenvironmenttotheDataBenchToolboxthroughthewebuserinterfacetoletitknowwherethedeploymentwilltakeplace(i.e.thehosts,credentials,etc.).

2. In-housesecuredeploymentandexecution:Inmanycases,usersandorganizationswould like to run the benchmark in their own infrastructure or in any secureinfrastructure.Therefore,givingthedetailsoftheirinternalsecureenvironmenttothe DataBench Toolbox in ourmain central deployment is not desirable or even

Page 28: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

28

feasibleforusers.Inthiscase,asitiscommoninthemajorityofthebenchmarkingframeworks,theDataBenchToolboxwillofferthepossibilitytodownloadaversionoftheToolboxwiththenecessaryfunctionalitytodeployandrunthebenchmarksinthe user infrastructure, and to eventually return the results of the benchmarkingrunstoDataBenchcentralplatform.Forinstance,theToolboxcouldbedownloadedasanAnsibleplaybook,providingaclient-sideuserinterfacetosetthebenchmarkupinthelocalresourcesandrunseveralbenchmarkrounds.TheToolboxwillofferthemeanstogettheresultsoftheexecutingbacktotheserver-sidecounterpartoftheToolbox,ethermanuallyorautomatically(preferredoption).

Each benchmarking system provides different ways to retrieve results from thebenchmarkingruns,rangingfromlogsorothertypeoffilestodatabases,plusevensomevisualizationtools.DataBenchisnotaimingatreplacingthem,butrathertogetthedataintotheDataBenchToolbox,harmonizeitandofferaddedvalueservices forcomparison,KPIabstractionorbusinessinsights.Inordertodothat,theresultsfromtheexecutionroundsshouldbegathered.TheinitialstepforeachbenchmarkpresentintheDataBenchToolboxisthereforethecreationofaconnectortogivetheirresultsback.Duringtheprojectlife-time,asetofconnectorsandtransformationrules forsomepopularbenchmarkswillbeprovided,alongwiththeguidelinestoextendtheseconnectorsfornewbenchmarksinaneasyway.

Asmentionedabove,itisimportanttogettheresults,sowewillfostertheadoptionofanautomaticinjectionmechanismwherepossible.Thismightbeimplemented,forinstance,bydefininganApacheKafka5topicwheretheconnectorwillpublisheitherautomaticallyormanually the results. For each benchmark aKafka Producerwill be provided to get theresultsintheappropriateformattotheKafkaTopicfromwheretheDataBenchToolboxwillbelistening.Otherapproachesformanualinjectionwillbealsochecked.

3.3.4 Analytics and KPI Management Subsystem

TheAnalyticsandKPIsmanagement subsystemshown inFigure14willbe in chargeofsupportingthemanagementoftechnicalandbusinessmetricsandtheassociatedderivationprocedures.TosupportthesefunctionalitiestheKPIsmanagementsubsystemiscomposedofthreesoftwarecomponents:

• MetricsManager:ThiscomponentwillallowTechnicalUserstoperformtheCRUDoperationsformanagingthetechnicalandbusinessmetricsintheToolbox.

• MetricsSpawning:Thiscomponentwilltakethetechnicalmetrics,analysethemand,wherepossible,spawnrelevantbusinessmetricsfromthem.

• KPIsCatalogue:Thiscomponentwillprovidethestoragemeanstoallthefeaturesrelated to technical and business metrics as well as the associated derivationprocedures.

5https://kafka.apache.org/

Page 29: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

29

Figure14.KPIsmanagementsubsystem

As explained in section 2.4 thederivationof businessKPIs from the results of technicalbenchmarksisnotastraightforwardprocess.Itwillbebasedmainlyonascenario-basedbottom-up approach, but with the aim to generalize the procedures by making use ofMachineLearning,visualizationandothertechniques,plustheusageofothernon-technicalresultsof theproject(suchasmarketmaterialandtheHandbookprovidedbyWP2andWP4).

3.3.5 Benchmark Management Subsystem

TheBenchmarkmanagementsubsystemshowninFigure15willbeinchargeofsupportingthedataandmetadatasupplyrelatedtoanon-registeredbenchmarkintotheToolbox.

Figure15.Benchmarkmanagementsubsystem

Tosupportthesefunctionalities,theBenchmarkmanagementsubsystemiscomposedofsixsoftwarecomponents:

• DataBenchInitializer:ThiscomponentwillallowtheDataBenchAdmintoperformthe initialization of the main storages defined in the Toolbox (User catalogue,Metrics/KPIscatalogueandBenchmarkscatalogue).

Page 30: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

30

• BenchmarkBusinessDelegate:Thiscomponentwillbeinchargeofdelegatingthebusinessactionsof thebenchmarksources,configurationanddeploymentrecipesmanagementtotheappropriatebusinessobjectthatimplementsthefunctionallogic.

• Benchmark Sources Business Delegate: This component is responsible for thebenchmark sources management that will be composed of i) data generator, ii)workloadsand,finally,iii)aresultinjector.

• BenchmarkConfigurationBusinessDelegate:Thiscomponentisresponsibleforthebenchmarkconfigurationmanagement.

• BenchmarkDeploymentRecipesBusinessDelegate:Thiscomponentisresponsibleforthebenchmarkdeploymentrecipes.Asatentativeapproach,therecipescouldbeanAnsibleplaybook.

• BenchmarkCatalogue:Thiscomponentwillbesupportingthestorageofallthedataandmetadatarelatedtotheregisteredbenchmark.Thisinformationwillberelatedtothebenchmarksources,configurationanddeploymentrecipes.

3.3.6 Visualization and Reporting Subsystem

The Visualization and Reporting subsystem shown in Figure 16 will be in charge ofsupporting the searching over the technical and derived business metrics related to abenchmarkexecution.

Figure16.VisualizationandReportingsubsystem

Tosupportthesefunctionalities,theVisualizationsubsystemiscomposedofthreesoftwarecomponents:

• KPIsSearchEngine:Thiscomponentwillallowsearchingontheresultsstorage.• AdvanceSearchEngine:Thiscomponentwillbeextendingthesearchfunctionality

ofthesearchengineprovidingmechanismstoperformadvancequeries.• FilteringSearchEngine:InthesamewayastheAdvanceSearch,thiscomponentwill

beextendingthesearchenginewithfilteringsearch.

Page 31: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

31

4. DataBench Toolbox architecture In the following sections we aim to provide a high-level overview of the proposedarchitectureoftheDataBenchToolbox.

First,inatechnologyagnosticmannerwewillpresentafunctionaldescriptionofthemainmodulesthatwillformtheToolboxaswellastheirinterconnectionpointingoutthemainfunctionalityofeachone.

Afterthat,wewillrelatetheBDVAreferencemodelwiththedescribedToolboxArchitectureandidentifytheexistinggaps.

Finally,wewillpresentapotentialimplementationofthefunctionalarchitectureinamoretechnicalway,describingindetaileachofthemoduleswiththeirpossibleimplementationandthetechnologiesbehindthem.

4.1 Functional overview of the DataBench Toolbox

One of the objectives of the DataBench project is to provide an easy-to-use big databenchmarking Toolbox that will give users the opportunity to gather metrics of theirsystems.

Figure17.Functionaloverviewoftheframeworkarchitecture

To achieve this, we propose a modular framework based on templates which will becomplementedwithawebinterfacefromwheretheusercandecideandchoosethemetrics

Benchmark

Users's Infrastructure

Technicaldata

model

Technical user

Databench admin

Business user

Interface

Businessdata

model

MetricsSpawner

ResultsDB

Benchmarking Framework

Results parser

Request andvisualize KPIs

Use

AdministerAdd

Delete

12

4

65

3

Page 32: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

32

neededandwhichwillalsoactasadashboardwheretheresultsoftheexecutionswillbegatheredandshowntotheuser,ascanbeseeninFigure17.

OurproposedArchitectureiscomposedofthefollowinginterconnectedmodules:

• Webinterface;• Benchmarkframework;• Resultsinterface;• Resultsparser;• Metricsspawner;• ResultsDB.

Wewilldiveinthedetailsofeachmoduleinthefollowingsections.

4.1.1 Web interface

As explained before, the main point of access to the benchmark tool will be the webinterface.Thisweb interface, as shown inpoint1ofFigure17,willbe connected to thebackendoftheToolbox.Thiswebinterfacemodulewillprovidedifferentusersofthetoolwiththefunctionalitytochoosewhichbenchmarkstheywanttorun.Itwillalsobeinchargeofprovidingalayerofconfigurationthattheuserscanfillintopreconfigurethetemplatesandthebenchmarkstoberunlateron.

Asexplainedinsection3.2.3,thewebinterfacewillprovidetwodifferentscenarios:

- First,itwillbeinchargeofprovidingalayerofconfigurationthattheuserscanfillinto preconfigure the templates and the benchmarks, so they can be run from theserverontheconfiguredinstances.

- Second,itwillactasanentrypointfromwheretheusercandownloadtheToolboxfor running it in-house, being the user the one in charge of managing all theinstallationandconfigurationoftheToolbox.

Moreover,thewebmodulewillbeusedtoshowinadashboardtheresultsoftheexecutionsandthederivedmetricsandbusinessinsightsasanaddedvalueforthoseorganizationsthatfinalizethefeedbackloopbygivingbacktotheToolboxtheresultsoftheirbenchmarkruns.

4.1.2 Benchmarking framework

ThiswebinterfacewillbeconnectedtothebackendoftheToolbox,calledBenchmarkingFramework in point 2 of the Figure 17, where the templates and recipes to run thebenchmarkswillbestored.TheideaofthismoduleistoactasarepositoryofrecipesthatwillbelaterusedtogenerateatemplatethatwillbeprovidedtotheTechnicalUserforthemtoconfigure itaccordingtothesystemwheretheywanttorunthebenchmarkson.ThismodulewillbethemainpointofinteractionfortheadministratorwiththeBenchmarkingFramework,sincehewillbeinchargeofhandlingtheintegration,additionanddeletionofthenew,updatedormodifiedbenchmarks.

With this module, we can also have the functionality to run some of the requiredbenchmarksinacloudprovider,giventhecredentialsandthecorrectconfiguration.Thiswouldallowuserswithoutaphysicalclustertorunthedifferentbenchmarksoncloudandtodecidetheconfigurationoftheirinfrastructurebeforeprovisioningit.

Page 33: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

33

Moreover,asstatedintheprevioussections,thefunctionalityoftheframeworkwillalsobeavailablefordownload,lettingtheuserrunitattheclientside.LeavingtheinstallationandtheconfigurationoftheframeworktotheTechnicalUser.

4.1.3 Results interface

TheresultsoftherunswilllateronbetransferredbacktotheBenchmarkingFrameworkthroughtheresultsinterface,point3inFigure17,ontheserversidetobeparsedintothedefinedtechnicaldatamodel.Theprojectwillprovidethis interface for theresults tobetransferredtotheframeworkeitherautomaticallybythebenchmarkrunormanuallybytheuser.

Thisinterfaceshouldbetheinterconnectionpointforanybenchmarkresultaddedtothetool.Inthisway,wewillhaveasinglepointofconnectionfortheresultparserstoreadthedatafromsimplifyingthewholearchitectureoftheFramework.

4.1.4 Results parser

Sinceeachbenchmarkreturnstheresultsinadifferentway,weneedtodefineamodulecapableofconvertingtheseheterogeneousresultsintoastandardizeddatamodel.Thiswillallow the Framework to have a consistent set of data among as many benchmarks aspossibletomakethecalculationofthebusinessmetricsaseasyaspossibleonthenextsteps.

Theresultparsermoduleisgoingtobecloselyrelatedtotheresultsinterfaceandwillworkin collaborationwith it. It will be open enough to parse any data that can be gatheredthroughtheresultsinterface.

Moreover, these resultparsermodulesneed tobeeasily integrated into theplatform toallowanybenchmarkprovidertogenerateitsownparser.Forthistaskwewillgenerateanddocumenttheinterfacestocommunicatethismodulewiththeresultsinterfaceandtheresults DB in an open way, so that we do not constrain the providers to any specifictechnology.

4.1.5 Metrics Spawning

InordertoderivetherequestedKPIsfromthetechnicalresultsofthebenchmarks,wewillneedanotherdedicatedmoduleintheArchitecture.

This module will be mainly connected to the results DB module, so it can parse thecorrespondingresultsfromthetechnicaldatamodelandcalculatethedefinedKPIsandattheend,writethembacktotheresultsDB.

As it was hinted in section 3.2.2 the process of deriving businessmetrics is still underdiscussion at the time of writing this document. In our current view, business metricsdependnotonlyoftechnicalmetricsgatheredintheResultsInterfacemodule,butofthecontext. The project will therefore provide a bottom-up approach by studying specificbusinessmetricsforclearbusinessscenarios.Weforeseethatsomespecificmetricscouldbecalculated,butmostofthemwillrequiretheassessmentofthebusinessuser.Therefore,the resultsor theMetrics Spawning componentwill be inmany cases a combination ofdifferent methods, ranging from specific algorithms, to machine learning approaches(learnedfrompreviousornewcases),technicalmetricshomogenization,comparisonandvisualizationtechniques,aswellasthehelpofthedocumentationaroundbigdatamarkettrendsandtheDataBenchHandbooktobedeliveredinthescopeofotherWorkPackages.

Page 34: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

34

Inthesamewayastheresultparsers,sincenewdatacanbeaddedtotheToolbox,itneedstobeopentobeextendedwithnewKPIsandcalculations.

4.1.6 Results DB

TheframeworkwillneedtoprovideaplacewheretheResultParserscanstorethedataintoandalsohaveaplacefromwherethewebinterfacecanreadtheresultstoshowtheminthedashboard.ItwillbetheresultsDBmodule.

Thismoduleneedstobecomplementedwithatleastatechnicaldatamodeldefinedandaninterface,sothattherestofthemodulescanmakeuseofit.Thetechnicaldatamodelneedstobegeneralandopenenoughtofitalltheresultsofthebenchmarksintoit,aswellasallowaneasyextensiontohouseanynewbenchmarkaddedtotheplatform.

4.2 Relation to the BDVA Reference Model

Based on our evaluation and first-hand experience with some of the benchmarks, weselected six initial benchmarks for extensive testing and integration in the DataBenchToolbox.Figure18depictsthemaspartofthedifferentBDVAReferenceModellayersthattheycover.Clearly,oneofthekeyrequirementsistoachieveawidecoverageoftheBigDatatechnologylayersintheBDVAarchitecture.

Figure18.BigDataValueReferenceModelwiththeAssociatedBenchmarks

TheBDVReferenceModelhasbeendevelopedbytheBDVA,takingintoaccountinputfromtechnical experts and stakeholders along the whole Big Data Value chain as well asinteractionswithotherrelatedPPPs.AnexplicitaimoftheBDVReferenceModelintheSRIA4.0documentistoalsoincludelogicalrelationshipstootherareasofadigitalplatformsuchasCloud,HighPerformanceComputing(HPC),IoT,Networks/5G,CyberSecurityetc.

Page 35: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

35

TheBDVReferenceModelmayserveascommonreferenceframeworktolocateBigDatatechnologies on the overall IT stack. It addresses the main concerns and aspects to beconsideredforBigDataValuesystems.

ItisanaimofDataBenchtoincludesupportforappropriatebenchmarkstocoverallofthecoreareasoftheBDVReferenceModel,andtolinktobenchmarksfortherelatedtechnicalareaslikeCloud,HPC,IoTetc.

TheBDVReferenceModelisstructuredintohorizontalandverticalconcerns.

• Horizontalconcernscoverspecificaspectsalongthedataprocessingchain,startingwithdata collectionand ingestion, reachingup todatavisualization. It shouldbenoted, that the horizontal concerns do not imply a layered architecture. As anexample, data visualization may be applied directly to collected data (datamanagement aspect)without the need for data processing and analytics. FurtherdataanalyticsmighttakeplaceintheIoTarea–i.e.EdgeAnalytics.Thisshowslogicalareas–buttheymightexecuteindifferentphysicallayers.

• Verticalconcernsaddresscross-cuttingissues,whichmayaffectallthehorizontalconcerns. In addition, verticals may also involve non-technical aspects (e.g.,standardizationastechnicalconcerns,butalsonon-technicalones).

GiventhepurposeoftheBDVReferenceModeltoactasareferenceframeworktolocateBigData technologies, it is purposefully chosen to be as simple and easy to understand aspossible. It thus does not have the ambition to serve as a full technical referencearchitecture. However, the BDV Reference Model is compatible with such referencearchitectures,mostnotablytheemergingISOJTC1WG9BigDataReferenceArchitecture–nowbeingfurtherdevelopedinISOJTC1SC42ArtificialIntelligence.TheevolutionofthesewillbefolloweduponbyDataBench.

ThetechnicalareasasidentifiedintheBDVReferenceModelare:

Horizontalconcerns:• BigDataApplications:Solutionssupportingbigdatawithinvariousdomainswill

oftenconsiderthecreationofdomainspecificusagesandpossibleextensionstothevarioushorizontalandverticalareas.This isoftenrelatedtotheusageofvariouscombinationsoftheidentifiedbigdatatypesdescribedintheverticalconcerns.

• DataVisualizationandUserInteraction:Advancedvisualizationapproachesforimproveduserexperience.

• DataAnalytics:Dataanalyticstoimprovedataunderstanding,deeplearning,andmeaningfulnessofdata.

• DataProcessingArchitectures:Optimizedandscalablearchitecturesforanalyticsof both data-at-rest and data-in-motion with low latency delivering real-timeanalytics.

• Data Protection: Privacy and anonymisation mechanisms to facilitate dataprotection.ItalsohaslinkstotrustmechanismslikeBlockchaintechnologies,smartcontractsandvariousformsforencryption.ThisareaisalsoassociatedwiththeareaofCyberSecurity,RiskandTrust.

• DataManagement:Principlesandtechniquesfordatamanagementincludingbothdata life cycle management and usage of data lakes and data spaces, as well asunderlyingdatastorageservices.

• CloudandHigh-PerformanceComputing(HPC):Effectivebigdataprocessinganddata management might imply effective usage of Cloud and High-Performance

Page 36: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

36

Computing infrastructures. This area is separately elaborated further incollaboration with the Cloud and High-Performance Computing (ETP4HPC)communities.

• IoT,CPS,EdgeandFogComputing:AmainsourceofbigdataissensordatafromanIoTcontextandactuatorinteractioninCyberPhysicalSystems.Inordertomeetreal-timeneeds,itwilloftenbenecessarytohandlebigdataaspectsattheedgeofthesystem.

Verticalconcerns:• Big Data Types and semantics: The following six big data types have been

identified–basedonthefactthattheyoftenleadtotheuseofdifferenttechniquesand mechanisms in the horizontal concerns, which should be considered, forinstance,fordataanalyticsanddatastorage:1)Structureddata;2)Timesseriesdata;3)Geospatialdata,4)Media, Image,Video andAudiodata;5)Textdata, includingNatural Language Processing data and Genomics representations; 6) Graph data,Network/WebdataandMetadata. Inaddition, it is important tosupportboththesyntacticalandsemanticaspectsofdataforallbigdatatypes.

• Standards: Standardisation of big data technology areas to facilitate dataintegration,sharingandinteroperability.

• Communication and Connectivity: Effective communication and connectivitymechanismsarenecessaryforprovidingsupportforbigdata.Thisareaisseparatelyelaborated further with various communication communities, such as the 5Gcommunity.

• Cybersecurity:BigDataoftenneedssupporttomaintainsecurityandtrustbeyondprivacy and anonymisation. The aspect of trust frequently has links to trustmechanismssuchasBlockchaintechnologies,smartcontractsandvariousformsofencryption. The CyberSecurity area is separately elaborated further with theCyberSecurityPPPcommunity.

• EngineeringandDevOps: forbuildingBigDataValuesystems.Thisarea isalsoelaborated further with the NESSI (Networked European Software and ServiceInitiative)SoftwareandServicecommunity.

• Data Platforms: This is a new area is being discussed as a grouping of datamanagementanddataprocessing,explicitlyforMarketplaces,IDP/PDP,EcosystemsforDataSharingandInnovationsupport.DataPlatformsforDataSharingincludeinparticularIndustrialDataPlatforms(IDPs)andPersonalDataPlatforms(PDPs),butalsoincludeotherdatasharingplatformslikeResearchDataPlatforms(RDPs)andUrban/City Data Platforms (UDPs). These platforms include efficient usage of anumberofthehorizontalandverticalbigdataareas,mostnotablytheareasfordatamanagement,dataprocessing,dataprotectionandcybersecurity.

• AIplatforms:InthecontextoftherelationshipbetweenAIandBigDatathereisanevolvingrefinementoftheBDVReferenceModel–showinghowAIplatformsasanew area typically include support forMachine Learning, Analytics, visualization,processingetc.intheuppertechnologyareassupportedbydataplatforms–forallofthevariousbigdatatypes.

With the experience from the initially selected benchmarks there will be a continuousanalysisoftheinclusionoffurtherappropriatebenchmarkstoensurethecoverageofthedifferentareas.

Page 37: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

37

4.3 Vision and potential implementation

Amoredetailedimplementationoftheoverviewspecifiedinsection2.2isexplainedinthefollowingsections.

4.3.1 Phase 1: Alpha version

Figure19showsapotentialimplementationoftheDataBenchToolbox.

Figure19.DataBenchToolboxpotentialimplementation

This potential implementation is based on Ansible 6 for the backend. Ansible is anorchestration,configurationanddeploymenttool,basedontemplatescalledplaybooksthatsimplifytheprocessofdeployingandconfiguringapplicationsindifferenthosts.

The idea istogenerateanAnsibleplaybookfordownloadingandconfiguringeachof thebenchmarksthatwillbeintegratedintheFramework,allowingthemtoberunbysimplyfillingintheneededconfigurationoftheclusterandrunningtheplaybook.Toallowtheseplaybookstobestoredwewillhaveaplaybook/configurationrepositorythatcanbeagitrepository.

6https://www.ansible.com/

Framework

Backend Frontend

Benchmark OrchestratorAnsible

Web

Fro

nten

d

Results parser

BigDataBenchresultparser

HiBenchresultparser

Benchmarkresultparser

Benchmarkresultparser

Benchmarkresultparser

Ansibleconfiguration

Ansibeconfiguration

Ansibleconfiguration

Ansibleconfiguration

Ansibleconfiguration

BigBenchBenchmarkBenchmarkBenchmark HiBench

Benchmarkresults

Benchmarkresults

Benchmarkresults

Benchmarkresults

Benchmarkresults

Results

Calculate Business KPI

Page 38: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

38

Withthisplaybookrepositoryinplace,theprojectwillhaveawebfrontendwheretheusercanchoosethedesiredbenchmarkstorun.This informationwillbeusedtoprovidetherequested playbooks to the user, so they can configure them according to theirrequirementsandrunthemintheirenvironment.

Figure20.Benchmarkorchestratordetail

AscanbeseeninFigure20,theBenchmarkOrchestratorisinchargeoftheinteractionwiththe repository and the frontend is the Benchmark Orchestrator. In detail, this modulecontainsacustomizedcodedmodulethatwillactasaninterconnectionhubbetweentherepository,thefrontend,theAnsibleenginethatwillruntheplaybooksandtheresultsDB.Themain tasks of theAnsible configuration selectormodulewill be to, given the set ofbenchmarksthattheuserhasdecidedtorun:

1- Gettherequestedplaybooksfromtherepository2- Providetheobtainedplaybookstotheusertofillintheneededconfigurationtorun

themontheirinfrastructure3- StoretheselectedconfigurationintheresultsDBtobe,lateron,usedwhenshowing

theresultsoftherun.4- ProvidetheconfiguredplaybooktotheAnsibleengineandrunthem.

Benchmark orchestrator

Ansibleconfiguration

(Docker)

Ansibleconfiguration(BareMetal)

Ansibleconfiguration

(Cloud)

RunDummy Local Docker

Cluster(Big data Europe)

Create if needed

Cloud Cluster

Configure connection

Physical Cluster

AnsibleEngine

Ansible Recipes RepositoryDocker Images RepositoryConfiguration repository

Ansibleconfiguration

selector

Read/Select

Save selectedconfiguration

ResultsDB

Page 39: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

39

TheAnsibleenginewillthen,giventheplaybookcorrectlyconfigured,executetheselectedbenchmarkintheconfiguredinfrastructure.Tobeabletodoso,theonlyrequirementistohave Ansible correctly installed in themachine that will act as master and configure itcorrectlysothatitisabletoconnecttotherequiredhostswithenoughprivileges.

Eachbenchmarkgeneratesresultsinadifferentformatandmeasuredifferentpartsofthesystem.Inordertobeabletousetheminahomogeneousway,theprojectneedstobeabletogatherandprocessthatresultsinacentralizedmanner.

We propose Apache Kafka 7 as the tool for the interconnection pipeline between thebenchmarks,theresultparsersandtheresultsDB,ascanbeseeninFigure21.

ApacheKafkaisadistributedpublish-subscribemessagingtooldesignedtobe highly scalable, to have littleoverhead, to be really fast andwithpersistence.Itisbasedontopicsandreplicas where any application canread/writetoanygiventopicwithoutimpactingtherestoftheapplicationsusingKafka.

Since each benchmark generates itsresultsinadifferentformat,weneedsome modules in charge of parsingthose results into a standardizedtechnical data model that could bethenreadandinterpretedtogeneratetheneededKPIs.

The only requirement for theseparsers,ascanbeinterpretedfromthediagraminFigure21,isthattheyshouldbeabletouseKafkaasasourceand/ortargetandrelationaldatabasesastarget.SinceApacheKafkaiswidelyusedintheBigDatacommunity,therearelotsofframeworksandlibrariestouseitwithalmostanyprogramminglanguageortoolsothechosentechnologytoimplementtheresultparsersiscompletelyopen.

Asfarastheresultsdatabaseisconcerned,thevolumeexpectationisnotlargeenoughtoexplore highly scalable technologies and the proposed technical data model is quiterelational.Weneedtorelatethebenchmarkinformation,theinformationoftherunandtheresults obtained, which looks like a perfect data model for any relational database.Moreover, we aim to connect directly the dashboard/ web frontend with the resultsdatabase,soweneedadatabasetechnologythatallowsconnectivityfromawebfrontendandprovidesallthecapabilitiestobeopentothepublicaswellastobeusedinternallytostoretheresults.

7https://kafka.apache.org/

ApacheKafka

Benchmark1Write results

Benchmark2

Databench Framework

Benchmark3

Benchmark4

Result parser

Result Parser

Results DataBase

Topic

Business KPI Calculator

Write results

Write results

Write results

Figure21.-Resultsprocessing

Page 40: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

40

4.3.2 Phase 2: Final version

TheprevioussectiondetailsaninitialimplementationoftheDataBenchFrameworkwheremostof themanualwork is relegated to theTechnicalUser thatowns the system tobebenchmarked.Oneofthemaingoalsoftheprojectistoprovideaneasy-to-useplatformthatneedsaslittle interactionaspossible fromtheusers.TohelpachievingthisobjectiveweproposeanevolutionoftheinitialideawherethemaindriveroftheplatformisthebusinessKPIsthatneedtobeobtained.

WeproposetoextendthepreviousphasebyreusingmostofthealreadycreatedAnsibleplaybooksbut,providedawebfrontend,maketheplatformintelligentenoughtodecidewhichrecipestocombineandpopulateautomaticallysomeoftheconfigurationvalues.

Tobeabletoachievethisobjective,theFrameworkmustknowtherelationshipbetweenbusinessKPIsandthebenchmarks,sothatitisabletoderivefromtheselectedKPIswhichbenchmarksneedtoberuntocalculatethem.

Inthiscase,theproposalincludesawebfrontendwherethebusinessuserscanchoosethebusiness metrics that they expect, and, with that information, the platform willautomaticallycombinealltheneededAnsiblerecipesinasingleonetobepresentedtotheTechnicalUsers,sotheycanjustfillinthemissingvariablesandrunitintheirinfrastructure.

In this secondphaseof the development, themain actorwill be theBusinessUserwhoknowswhatKPIsareneeded,andtheonlytaskforthetechnicalUserwouldbetofillthesystemconfigurationandruntherecipe.ThePlatformwillbeintelligentenoughtocombinealltheneededrecipessothatthebenchmarksandtheirresultscanbeprocessedtopresenttotheBusinessUsertherequestedmetricsinthefrontend.Asexplainedinsection2.4,thisbusiness KPI derivation process is still under consideration and difficult to generalize.Therefore,inthissecondphasethefinaldecisionontheimplementationofthederivationprocess will be made most probably involving not only algorithmic work to calculatebusinessKPIs,butalsovisualizationandMachineLearningtechniquesincombinationwithothermaterialprovidedbyDataBench(i.e.theHandbook)tohelptheBusinessUsertotakeinformedbusinessdecisions.

Page 41: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

41

5. Conclusions TheDataBenchToolboxwillbethecentralpointofcontactoftheusersoftheDataBenchFramework,andwillallowtheselection,deployment,executionofbigdatabenchmarks,aswell as the harmonization of the technicalmetrics and the derivation of business KPIs,wherepossible.

TheaimofthisdocumentwasthereforetopresenttheinitialarchitectureoftheDataBenchToolboxandassociatedtechnicalcomponents.Inordertodoso,wepresentedthedifferentelementsthatformtheDataBenchecosystem,puttingtheDataBenchToolboxinthecontextof the main functional elements to be delivered by the differentWork Packages of theproject.Anoverviewofthemainelementsoftheecosystem,namelytheDataBenchToolbox,the associated benchmarks, the KPI derivationmodule and the AI framework has beenpresented,alongwithreferencestoothersupportingnon-technicalelementssuchas theDataBenchHandbook.

After depicting the context of the project, the document dives in the architecture of theToolbox. The DataBench Toolbox aims to be an umbrella framework for big databenchmarkingreusingexistingeffortswell-settledinthecommunityandthereforewiththefocusonnotreinventingthewheel,butratherreusethebest-breedbenchmarks.SomeoftheexpectedfeaturestoreuseexistingbigdatabenchmarksandderivebusinessinsightshavebeenpresentedbymeansofUMLusecasediagrams,aswellasapotentialarchitecturefor theToolbox.However, thesediagramsshouldnotbe seenaswritten intostone.Thecurrentviewmaychangeabitinthefuture,butthemainusecaseswillberespectedandenhancedifneeded.Ideas,phasesandpotentialtechnologiestoimplementtheToolboxhavebeenalsoexplainedandreported.

ThisinitialarchitectureistheinputtothefirstimplementationoftheDataBenchToolbox.AnalphaversionofthetoolwillbedeliveredinM18.

Page 42: DataBench D3.1 Ver.1.0 - Big Data Benchmarking · of the big data benchmarking community. The goal of the Toolbox is to provide tooling support to big data benchmarking users to on

42

6. References

Ahmad Ghazal,Tilmann Rabl,Minqing Hu,Francois Raab,Meikel Poess,AlainCrolotte,Hans-ArnoJacobsen:BigBench:towardsanindustrystandardbenchmarkforbigdataanalytics.SIGMODConference2013:1197-1208

AhmadGhazal,TodorIvanov,PekkaKostamaa,AlainCrolotte,RyanVoong,MohammedAl-Kateb,Waleed Ghazal,Roberto V. Zicari: BigBench V2: The New and ImprovedBigBench.ICDE2017:1225-1236

BigBench,https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench,2018

BDVSRIA,BigDataValueassociation,EuropeanBigDataValue-StrategicResearchandInnovationAgenda,vers.4.0,Oct.2017,http://www.bdva.eu/sria

Brian F. Cooper,Adam Silberstein,Erwin Tam,Raghu Ramakrishnan,Russell Sears:BenchmarkingcloudservingsystemswithYCSB.Proceedingsofthe1stACMSymposiumonCloudComputing,SoCC2010,Indianapolis,Indiana,USA,June10-11,2010

Dakshi Agrawal,Ali Raza Butt,Kshitij Doshi,Josep-Lluís Larrabee,Min Li,Frederick R.Reiss,Francois Raab,Berni Schiefer,Toyotaro Suzumura,Yinglong Xia: SparkBench - ASparkPerformanceTestingSuite.TPCTC2015:26-44

HiBenchSuite,https://github.com/intel-hadoop/HiBench,2018

Huang,S.,Huang,J.,Dai,J.,Xie,T.,Huang,B.:Thehibenchbenchmarksuite:Characterizationofthemapreduce-baseddataanalysis.In:WorkshopsProceedingsofthe26thInternationalConferenceonDataEngineering,ICDE2010,March1-6,2010,LongBeach,California,USA.pp.41–51(2010)

Sanket Chintapalli,Derek Dagit,Bobby Evans,Reza Farivar,Thomas Graves,MarkHolderbaugh,Zhuo Liu,Kyle Nusbaum,Kishorkumar Patil,Boyang Peng,Paul Poulosky:BenchmarkingStreamingComputationEngines:Storm,FlinkandSparkStreaming.IPDPSWorkshops2016:1789-1792

SparkBench,https://github.com/CODAIT/spark-bench,2018

TodorIvanov,RekhaSinghal:ABench:BigDataArchitectureStackBenchmark.Companionofthe2018ACM/SPECInternationalConferenceonPerformanceEngineering,ICPE2018,Berlin,Germany,April09-13,2018

TPCx-IoT, http://www.tpc.org/tpc_documents_current_versions/pdf/tpcx-iot_v1.0.3.pdf,2018

YCSB,https://github.com/brianfrankcooper/YCSB,2018

YSB,https://github.com/yahoo/streaming-benchmarks,2018