an improved hadoop performance model for job estimation ... · to identify stolen automobiles in...

22
An Improved Hadoop Performance Model For Job Estimation And Resource Provisioning In Public Cloud D.M.Kalai Selvi 1 and Dr.P.Ezhumalai 2 1 M,E(CSE),R.M.D Engineering college, kavaraipettai,chennai-601206 TamilNadu [email protected] 2 HOD of CSE Department, R.M.D Engineering college, kavaraipettai,chennai-601206 TamilNadu January 4, 2018 Abstract Hadoop is an open source implementation of Map Re- duce Framework which is mostly used for parallel process- ing of big data analytics. Hadoop users accept the op- portunities/resource plan/offers provided by Cloud Service Providers to lease the resources needed for processing their hadoop jobs and pay as per their use. But there is no effi- cient resource provisioning mechanism which can complete the job within specified deadline. It is the responsibility of the user who should manually modify the resource required for processing their hadoop jobs. So we proposed an im- proved HP (Hadoop Performance) model which used Hive bucket and partitioning concept to reduce the long duration of shuffle sort and efficiently complete the job within the de- sired deadline. Furthermore, efficient resource provisioning can be obtained using feedback system and user profile to determine their exact need and we suggest an optimal plan for their resource needs. 1 International Journal of Pure and Applied Mathematics Volume 118 No. 16 2018, 1369-1389 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu Special Issue ijpam.eu 1369

Upload: others

Post on 20-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

An Improved Hadoop PerformanceModel For Job Estimation And Resource

Provisioning In Public Cloud

D.M.Kalai Selvi1 and Dr.P.Ezhumalai21M,E(CSE),R.M.D Engineering college,kavaraipettai,chennai-601206 TamilNadu

[email protected] of CSE Department,R.M.D Engineering college,

kavaraipettai,chennai-601206 TamilNadu

January 4, 2018

Abstract

Hadoop is an open source implementation of Map Re-duce Framework which is mostly used for parallel process-ing of big data analytics. Hadoop users accept the op-portunities/resource plan/offers provided by Cloud ServiceProviders to lease the resources needed for processing theirhadoop jobs and pay as per their use. But there is no effi-cient resource provisioning mechanism which can completethe job within specified deadline. It is the responsibility ofthe user who should manually modify the resource requiredfor processing their hadoop jobs. So we proposed an im-proved HP (Hadoop Performance) model which used Hivebucket and partitioning concept to reduce the long durationof shuffle sort and efficiently complete the job within the de-sired deadline. Furthermore, efficient resource provisioningcan be obtained using feedback system and user profile todetermine their exact need and we suggest an optimal planfor their resource needs.

1

International Journal of Pure and Applied MathematicsVolume 118 No. 16 2018, 1369-1389ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version)url: http://www.ijpam.euSpecial Issue ijpam.eu

1369

Key Words : Hadoop; Map Reduce; Cloud ServiceProviders; Hive; Resource Provisioning; Bucket.

1 Introduction

Rapid evolution of raw data from all sources like sensors, blogspots,social networking sites, aviation, etc which explored from our sur-roundings. This leads to the phenomena called Big Data whichmoved IT solutions to the other dimension where not only com-puter science but also all other fields like electronics,civil,transportneed IT solutions for their business e.g. In U. S police depart-ment installs traffic cameras to read license plates of Automobilesto identify stolen automobiles in real time. Hadoop framework hascapability to handle both structured and unstructured data where80% of the data are unstructured data and only 20% are structureddata among the whole big data sets. [1] Based on the challengesfaced by Big Data it can be characterized by four dimensions:

VOLUME: To manage large volumes of data

VELOCITY: To manage rapidly arriving data

VARIETY: To manage both structured and unstructured data

VERACITY: To validate the correctness of large amount of ar-riving data.

Cloud Service provider like Amazon EC2 enables user to configurethe resources needed for their application, but in current form ofEC2 cloud does not support Hadoop jobs with deadline constraint.So, it is the sole responsibility of the user to assign necessary re-sources needed to complete the job with specified deadline which ishighly challenging task. It is necessary to focus on Hadoop perfor-mance modeling which is also a critical task.

Hadoop jobs has multiple processing phases with three corephases Map phase, shuffle phase, reduce phase such that first slotof the shuffle phase is processed in parallel with map phase calledoverlapping stage. Other slots of the shuffle phase are processed

2

International Journal of Pure and Applied Mathematics Special Issue

1370

after the map phase is processed called non-overlapping stage.In the existing Hadoop performance model does not has the

capability to automate the resource provisioning and complete theHadoop job within the specified constraints. Furthermore, the num-ber of reduce slots are strictly constant. So we proposed an im-proved Hadoop performance model which will automatically con-figures resources required by end user and complete the job withindeadline. The contents of this paper are organized as follows. Insection 2 we review all related literature. Section 3 will be problemanalysis and Section 4 will be our proposed work and finally section5 & 6 will be our experimental results and conclusion.

2 Literature Review

[2] Starfish is MADDER (MAD-Magnetism, Agility, Depth) andSelf tuning system for big data analytics on big data. The main goalof starfish is to enable Hadoop user and applications to get goodperformance automatically throughout the data lifecycle withoutmanually understanding the need of the Hadoop user and manuallyallocate the needs by tuning various knobs. STARFISH collectsdetailed information of Hadoop job at a very fine granularity forautomation of job estimation which is a burden.

[3] The general problem in forming cluster is to determine re-source and configure it according to Map Reduce Job in order tomeet the desired constraints such as cost, deadline, execution timefor given big data analytics referred as cluster sizing problem. Inthis paper, Elastisizer a system added on the top of Hadoop stackto automate the allocation of resources to meet the desired require-ments of massive data analytics by the Elastisizer will provide reli-able answers to cluster sizing queries in an automated fashion by us-ing detailed information of Hadoop job profile.In Elastisizer, needsdetailed information of job profile which increase the overhead ofhigh execution time.

[4] In this paper, the proposed Hadoop performance model con-siders both overlapping and non-overlapping stages. It uses scalingfactor to automatically scale up or scale down the resource allo-cation within the specified timeframe. Though this Hadoop per-formance model increase the possibility of automatic resource al-

3

International Journal of Pure and Applied Mathematics Special Issue

1371

location but it suffers simple linear regression. The HP model isrestricted to a constant number of reduce tasks.

[5][6] CRESP provides automatic estimation of job executiontime and resource provisioning in an optimal manner. In general,the multiple waves generate better performance than single waveof the reduce phase. In CRESP, the number of reduce slots shouldbe equal to the number of reduce slots and it consider only singlewave of the reduce phase. In general, traditional Hadoop clusterenvironment assumed as homogeneous clusters. But now there is aexplosion of heterogeneous Hadoop cluster due to the evolvementof more parallel processing of data intensive application by manycompanies. Due to heterogeneity of resources in Hadoop cluster,there is an inefficiency of Hadoop job completion time as well asresults in the bottleneck of resources.

In [9] they proposed bound-based performance modeling of MapReduce jobs completion times in heterogeneous Hadoop cluster.But they does not explore the properties of Hadoop job profilesas well as does not analysis the breakdown of Hadoop job execu-tion on different types of heterogeneous nodes in the Hadoop clus-ter. In this paper [10], they proposed a framework called ARIA(Automatic Resource Inference and Allocation ) for MapReduceenvironments to solve the problem of automatic resource alloca-tion. It contains three inter-related componenets. First, job profilewhich summarizes performance characteristics of the underlying ap-plication during the map and reduce stage. Second , MapReduceperformance model for the given job and its SLO(Service Level Ob-jective) to estimate the amount of resource required to complete thejob within deadline. Finally will be SLO scheduler which is basedon EDF(Earliest Deadline First) that determines job ordering andallocating resources. The major disadvantage in this frameworkwas that they did not considered node failures.

In [11] they proposed a framework called AROMA (AutomatedResource Allocation and Configuration of Map Reduce Environ-ment). It was developed to automate resource provisioning in het-erogeneous cloud and manipulating configuration of Hadoop param-eters for achieving goals but minimizing the incurring cost. The ma-jor disadvantage in this system were configuration ineffective anddoes not support multi-stage job workflows. [12] introduces a PAR-ALLAX PROGRESS ESTIMATOR, which can estimate the per-

4

International Journal of Pure and Applied Mathematics Special Issue

1372

formance by estimating the remaining time of MapReduce pipelinesbased on thetime had elapsed. The key strategy the author usedis dividing the processes of MapReduce into five key pipelines andestimate the remaining time based on the time had elapsed.

However, the stated approach does not study an explicit costfunction that can be used in optimization problems.[13] also pro-posed optimal resource provisioning with minimal cost, but thereis some prediction error.

In order to improve the efficiency and reliability of Hadoop Per-formance model to provide user friendly automated resource pro-visioning scheme and effective job completion with the predefinedconstraints, an improved Hadoop performance model is proposed.The major contributions in this paper are:

• Improved Hadoop performance work design all the core phasesof Map Reduce framework i.e., map phase, shuffle phase, par-titioner phase, combiner phase, reduce phase.

• The improved HP model which used Hive bucket and parti-tioning concept to reduce the long duration of shuffle sort andefficiently complete the job within the desired deadline.

• Furthermore, efficient resource provisioning can be obtainedusing feedback system and user profile to determine their ex-act need and we suggest an optimal plan for their resourceneeds.

• The performance of improved HP model is evaluated in Ama-zon EC2 cloud and the evaluation results shows that the im-proved HP model outperforms both HP and Starfish model.

3 Problem Analysis

3.1 Modeling Hadoop Job Phases

In Hadoop, any job which needed parallel processing and high per-formance will enter into the map reduce framework. Usually, theentered job is splitted into three core phases such as Map, shuffleand Reduce phase. In the existing system, fine granularity infor-mation about the job is collected [4] which is useful to complete

5

International Journal of Pure and Applied Mathematics Special Issue

1373

the job within deadline. Map tasks are executed in map slots andreduce tasks are executed in reduce slots. Each slot runs a task ata time. Slots are assigned in terms of CPU and RAM. Map andreduce phase can be executed in single or multiple phases. [7]

3.1.1 Map Phase

The map phase accepts input dataset from Hadoop DistributedFile system (HDFS) and which should be in key value pair format.Mostly datasets that prefer parallel processing supports NOSQLdatabase in which the dataset are stored in key-value database,column family database, graph database, document database. Themap phase initially split the input datasets into blocks (by default64MB) where each block considered as map tasks and assign a mapslot.

After the map phase is processed, the intermediate key-valuepair result will be generated which will be stored in buffer not indisk in order to reduce the complication due to replica. The totalexecution time for map phase will be

Tmtotal = Tmavg∗Nm

Nmslot −−−−−−−−−− → 1

3.1.2 Shuffle Phase

The process of transferring the map output in a sorted manner tothe reduce phase in order to minimize the work of reducer is calledshuffle. In this phase, Hadoop job fetches intermediate map outputand copies it in one or more reducers. If N r ≤ N slot , then theshuffle phase will be completed in single wave . The total executiontime for shuffle phase will be

T stotal = T savg∗Nr

Nrslot −−−−−−−−−− → 2

Otherwise, the shuffle phase will be completed in multiple waves.Then the total execution time for shuffle phase will be

T stotal = (TW 1 avg ∗NW 1)+...+(TWn avg ∗NWn)Nrslot −− → 3

Fig 1 describes job execution flow of existing Hadoop perfor-mance model.

6

International Journal of Pure and Applied Mathematics Special Issue

1374

Hadoop job execution flow

Table 1.1 defines variables

7

International Journal of Pure and Applied Mathematics Special Issue

1375

3.1.3 Reduce Phase

The reduce phase is a user-defined function in which user can cus-tomized their processing logic. The customized reduce functionprocess intermediate map output and produces the final output.Usually, the final output after the reduce phase will be stored inHDFS. The Hadoop performance model supports both overlappingand non overlapping stages. The total execution time for reducephase will be

T rtotal = T ravg∗Nr

Nrslot −−−−−−−−−− → 4

4 Proposed Work

4.1 Improved Hadoop Performance Model

4.1.1 Job Estimation Model

In improved Hadoop performance model, we handle both overlap-ping and non overlapping stages. Generally when processing inmultiple waves, the first wave of the shuffle phase starts immedi-ately after completion of the first wave of the map phase and thefirst wave of the shuffle completed only after all waves of the mapphase started which creates a long execution time for shuffle phasewhich can be overcome by HIVE.

When an existing data infrastructure based on relational databasewants to move in Hadoop which can be achieved using HIVE usesHQL(HIVE Query Language) similar to SQL(Structured Querylanguage) use to fetch and process data from structured data in-frastructure. [7][8]

The two most important concept in HIVE used to overcome theproblem of latency by shuffle phase are bucketing and partition-ing. When the input datasets are bucketed based on their columnusing hash function along with partitioning, the overhead of largedatasets are reduced using command line interface or web user in-terface where bucketing and partitioning commands are queried inHQL based on user needs.

Already the datasets are bucketed i.e. sorted based on columnsthe task of shuffling get reduced which improves performance. Fur-thermore, it helps to complete job within deadline. The estimated

8

International Journal of Pure and Applied Mathematics Special Issue

1376

execution time for Hadoop jobs are finalized by user profile collectsdetailed information about Hadoop jobs such as data size, maptasks, reduce tasks and their duration.

When a job processes an increasing size of an input dataset, thenumber of map tasks is proportionally increased while the numberof reduce tasks is specified by a user in the configuration file. Thenumber of reduce tasks can vary depending on user’s configurations.When the number of reduce tasks is kept constant, the executiondurations of both the shuffle tasks and the reduce tasks are linearlyincreased with the increasing size of the input dataset as consideredin the HP model. This is because the volume of an intermediatedata block equals to the total volume of the generated intermediatedata divided by the number of reduce tasks. As a result, the volumeof an intermediate data block is also linearly increased with theincreasing size of the input dataset. However, when the numberof reduce tasks varies, the execution durations of both the shuffletasks and the reduce tasks are not linear to the increasing size ofan input dataset.

In the improved HP model, we considered varied number ofreduce slots based on the user profile which has detailed informationabout past history input data size and their execution time and thenumber of map slots and reduce slots which helps to estimate thevaried number of reduce slots.

4.1.2 Resource Provisioning Model

Starfish and other Hadoop performance model [2] [3] [4] [5] supportsmanual or self tuning of the resources needed for their job execution.Resource provisioning plays a key role as a job can be completedwithin specified deadline say t only when the sufficient resourcesrequired for the job for their processing has been provided.

In improved HP model supports automatic resource provision-ing without manipulating knob parameters using user feedback sys-tem. In user feedback system, collects fine granularity informationabout the history of already processed job and by assigning weightor priority to each job based on previous logs so that high criti-cal task can be provided resources immediately and automatically.Since even high critical task are provided resources actively whichhelps to complete Hadoop jobs within specified constraint.

9

International Journal of Pure and Applied Mathematics Special Issue

1377

10

International Journal of Pure and Applied Mathematics Special Issue

1378

5 Performance Evaluation

5.1 Experimental Setup

The performance of improved Hadoop performance model was eval-uated on two experimental setup i.e Amazon EC2 cloud and Googlecloud.The experimental Hadoop cluster was setup on Amazon EC2Cloud using 20 m1.large instances. The specifications of the m1.large.In this cluster, we used Hadoop-1.2.1 and configured one instanceas Name Node and other 19 instances as Data Nodes. The NameNode was also used as a Data Node. The data block size of theHDFS was set to 64MB and the replication level of data block wasset to 3. Each instance was configured with one map slot and onereduce slot. We run WordCount applications on Hadoop clusterand employed Starfish to collect the job profiles. For each applica-tion running on each cluster, we conducted 10 tests. For each test,we run 5 times and took the average durations of the phases.

11

International Journal of Pure and Applied Mathematics Special Issue

1379

Fig.4. the performance of the WordCount application with a variednumber of reduce tasks.

The following graph describes the performance of improved HPmodel versus HP model for job estimation time

12

International Journal of Pure and Applied Mathematics Special Issue

1380

The following graph describes the estimated resource provision-ing and compares resource provisioning performance HP model ver-sus improved HP model

The improved Hadoop performance model was also evaluatedin Google cloud. The project scenario based on which improvedHadoop performance model evaluated was described here.

In Google cloud, file directory showing a list of files representingfamous song details was stored (big data).A list of plans regardingthe configuration of map and reduce slots are stored in a separatetable in Hive. The contents of the file were stored in a table in Hivein comma separated format. The details of user profile regardingconfiguration of the resource required and user feedback were storedin database in Hive.

The file details was analyzed and partitioned big data as bucketsusing hash function based on major columns and the list of optimalresource plan is displayed along with the optimal resource config-

13

International Journal of Pure and Applied Mathematics Special Issue

1381

uration. Using compute engine a virtual machine was initializedand Hadoop 1.2.1 was installed in that VM along with Hive 1.2.xand Pig 0.16.0 for manipulating the big data. The file directorywas transferred from local file system to Hadoop Distributed FileSystem (HDFS).

Command

The file which are stored in .CSV format as1 9,98993,3,Bengali,Contemporary celtic,3,AnjanDutt,paid,album2 9,105156,3,Bengali,Contemporary celtic,3,AnjanDutt,paid,album

14

International Journal of Pure and Applied Mathematics Special Issue

1382

3 9,113452,3,Bengali,Contemporary celtic,3,AnjanDutt,paid,album4 9,119398,2,Bengali,Contemporary celtic,3,AnjanDutt,free,movie5 9,130766,3,Bengali,Contemporary celtic,2,AnjanDutt,free,movie6 9,133420,3,Kannada,Carnival songs,2,Anu Malik,free,movie7 10,11091,4,Kannada,Carnival songs,2,Anu Malik,free,movie8 10,19275,3,Kannada,Carnival songs,2,Anu Malik,free,movie9 10,21140,4,Kannada,Carnival songs,2,Anu Malik,free,album10 10,44487,3,Kannada,Carnival songs,1,Anu Malik,free,album11 10,49147,1,Kannada,Carnival songs,1,Anu Malik,free,album12 10,54923,5,Kannada,Contemporary celtic,1,AnuMalik,free,album13 10,63451,3,Kannada,Contemporary celtic,1,AnuMalik,free,album14 10,75206,3,Kannada,Contemporary celtic,1,AnuMalik,paid,movie15 10,104641,1,Kannada,Contemporary celtic,5,AnuMalik,paid,movie16 10,110054,3,Kannada,Contemporary celtic,5,AnuMalik,paid,movie

Command to load file.csv

LOGS= LOAD ’/user/root/songs.csv’; LOGS GROUP= GROUPLOGS ALL; LOG COUNT = FOREACH LOGS GROUP GEN-ERATE COUNT(LOGS); STORE LOG COUNT INTO’/user/hive/warehouse/resource.db/filedetails /songs.txt’;

use resource; drop table if exists filedetails; CREATE EXTER-NAL TABLE ‘filedetails‘( ‘totalrow‘ string) ROW FORMAT DE-LIMITED FIELDS TERMINATED BY ’,’ LINES TERMINATEDBY ’\n’LOCATION’hdfs://localhost:9000/user/hive/warehouse/resource.db/filedetails/songs.txt’;

Command To Analyze the plan

set hive.cli.print.header=true; use resource; drop table if existsanalysis;

15

International Journal of Pure and Applied Mathematics Special Issue

1383

create table analysis as select a.id,a.memory,a.processor,a.ssd,a.transfer,a.price,a.size from plans a JOIN filedetails b where a.size ¿ cast(b.totalrowas int);

select a.id,a.memory,a.processor,a.ssd,a.transfer,a. price,a.size,b.Feedback,b.Name, b.Email from analysis a JOIN feedback b wherea.id = b.Planid;

Sample Contents Stored in Hive Table Analysis3 2GB 2 Core 40 GB 3 TB 20$ 30004 4GB 2 Core 60GB 4 TB 40$ 50005 8GB 4 Core 80GB 5 TB 80$ 70006 16GB 8 Core 160GB 6TB 160$ 80007 32GB 12 Core 320GB 7TB 320$ 90008 48GB 16 Core 480GB 8TB 480$ 100009 64GB 20 Core 640GB 9TB 640$ 11000

Feedback

OK NULL NULL Timestamp Name Email Feedback 1 1 16-08-2015 15:00 vaishali.m [email protected] It would be evenmore useful and clear if we had the option of working it live.....andpractical

2 2 16-08-2015 15:00 yukti sharma [email protected] itwas good. need more time to elaborate.next time try for longerinstances

3 3 16-08-2015 15:00 subhash [email protected] it is verynice and i have learnt many things about google design cloud duringthis session

4 4 16-08-2015 15:00 akhil vemuri [email protected] but expecting more

5 5 16-08-2015 15:01 Devang P Patel [email protected] thecloud was good...

6 6 16-08-2015 15:01 H.VIGNESH [email protected]”awesome experience we had but time is not enough!!! NULL NULLNULL NULL NULL NULL NULL NULL NULL NULL NULLNULL

7 7 16-08-2015 15:01 vidya varsshini [email protected] very nice but can make it even more better

8 8 16-08-2015 15:01 subathra.r [email protected] not so effective

16

International Journal of Pure and Applied Mathematics Special Issue

1384

9 9 16-08-2015 15:01 sruthi r [email protected] nice..butlittle bit lag

10 1 16-08-2015 15:01 SUSHMITHA.U [email protected] have been more effective if it was more storage

11 2 16-08-2015 15:01 vijayashankar [email protected] to use this pack. Great eperience

12 3 16-08-2015 15:01 venkatraman.D [email protected] nice cloud any very good one for improving our knowledge

Plans

NULL memory processor ssd transfer price NULL1 512MB 1 Core 20 GB 1TB 5$ 5002 1GB 1 Core 30 GB 2 TB 10$ 20003 2GB 2 Core 40 GB 3 TB 20$ 30004 4GB 2 Core 60GB 4 TB 40$ 50005 8GB 4 Core 80GB 5 TB 80$ 70006 16GB 8 Core 160GB 6TB 160$ 80007 32GB 12 Core 320GB 7TB 320$ 90008 48GB 16 Core 480GB 8TB 480$ 100009 64GB 20 Core 640GB 9TB 640$ 11000

Output

17

International Journal of Pure and Applied Mathematics Special Issue

1385

6 CONCLUSION

Running a Map Reduce Hadoop job on a public cloud such as Ama-zon EC2 necessitates a performance model to estimate the job exe-

18

International Journal of Pure and Applied Mathematics Special Issue

1386

cution time and further to provision a certain amount of resourcesfor the job to complete within a given deadline. This paper haspresented an improved HP model to achieve this goal taking intoaccount multiple waves of the shuffle phase of a Hadoop job. Theexperimental results showed that the improved HP model outper-forms both Starfish and the HP model in job execution estimationand resource provisioning.One future work would be to consider dynamic overhead of theVMs involved in running the user jobs to minimize resource over-provisioning.

References

[1] http://www.redbooks.ibm.com/

[2] H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B.Cetin, and S. Babu, Starfish: A Self-tuning System for BigData Analytics, in In CIDR, 2011, pp. 261272.

[3] H. Herodotou, F. Dong, and S. Babu, No One (Cluster) SizeFits All: Automatic Cluster Sizing for Data-intensive Ana-lytics, in Proceedings of the 2nd ACM Symposium on CloudComputing (SOCC 11), 2011, pp. 114

[4] A. Verma, L. Cherkasova, and R. H. Campbell, Resource provi-sioning framework for mapreduce jobs with performance goals,in Proceedings of the 12th ACM/IFIP/USENIX internationalconference on Middleware, 2011, pp. 165186.

[5] K. Chen, J. Powers, S. Guo, and F. Tian, CRESP: TowardsOptimal Resource Provisioning for MapReduce Computing inPublic Clouds, IEEE Transcation Parallel Distrib. Syst., vol.25, no. 6, pp. 1403 1412, 2014.

[6] H. Herodotou, Hadoop Performance Models, 2011. [Online].Available: http://www.cs.duke.edu/starfish/files/hadoop-models.pdf. [Accessed: 22-Oct-2013].

[7] Hadoop Definitive guide Tom White OReilly Press

19

International Journal of Pure and Applied Mathematics Special Issue

1387

[8] E. Capriolo, D. Wampler, and J. Rutherglen, ProgrammingHive, O’Reilley, 2012

[9] Z. Zhang, L. Cherkasova, and B. T. Loo, Performance Model-ing of MapReduce Jobs in Heterogeneous Cloud Environments,in Proceedings of the 2013 IEEE Sixth

20

International Journal of Pure and Applied Mathematics Special Issue

1388

1389

1390