on traffic -aware partition and aggregation ...mapreduce & hadoop, its open -source utilization,...

ON TRAFFIC-AWARE PARTITION AND AGGREGATION IN MAP

REDUCE FOR BIG DATA APPLICATIONS

S.Murali1*

1School of Computer Science and Engineering, VIT University, Vellore, Tamilnadu, India.

[email protected]

Abstract

Map Reduce is backbone of Hadoop that provides the required scalability and easy data processing of huge

volume of data processing and storage. Map Reduce is a procedure which involves two phases namely Map phase and

Reduce phase. The input for MapReduce comes from Hadoop Distributed File System (HDFS). Here map tasks takes

input the huge volume of data or information and converts it into another form of data which is in the form of pairs

called KeyValue pairs. Then it is followed by Reduce task whose input is the new representation of data by map task.

Here a collection of reduce tasks are run in parallel. Conventionally in Map Reduce, a Name node is used to keep

records of the data segregation, the file structure and also how the chunks of data are placed. At boot up, thendata

node sends all the list of chunks to the name node. Whenever there is change in data file, a new version number is

assigned to it. Since the new version might have changes that hardly differ from older version, thus causing a large

volume of replicated data.

To improve the performance of Map & reduce jobs, many proposed systems have been shaped but they all fail

to observe and reduce the network-freight. Here a method is proposed and simulated to reduce the traffic by reducing

the number of duplicated or replicated files in the cloud. This is proposed and simulated using two major concepts

jointly. First one is a file redirection. This is done in the case when a client requests a server and the requested server

is busy, then the request is redirected to another server. Second one is replication control concept. This concept is

described using logical block addressing (i.e. LBA) and MD5 (Message Digest 5) algorithm that involves generating

unique hash code for the contents in the files and thus preventing files having same data being replicated multiple

times.

Keywords: Hadoop, data aggregation, MD5

* Corresponding author. E-mail address: [email protected]

Tel.: +91-9994715546

International Journal of Pure and Applied MathematicsVolume 117 No. 7 2017, 11-22ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version)url: http://www.ijpam.euSpecial Issue ijpam.eu

11

1. Introduction

Big data is a caption that is used for large and complex data sets on which the conventional softwares for data

processing are inadequate. Hadoop is thus designed for such huge volume of data processing and storage. The framework is

such that it can be scaled to millions of servers beginning from a single server. Big data is a caption that is used for describing

a huge collection of data. This data is very huge in size and it grows exponentially with respect to time. Conventionally used

softwares for data processing and data management become inadequate for such a large volume of data. This makes the ability

to store the data and process the data inefficient. Major examples include:

One Terabyte of fresh new data that is generated every day at the NewYork Stock Exchange (NYSE),

500+ TB of new data gets generated that includes photos, videos, messages, comments, etc. on Facebook

(social media).

Petabytes of data is generated with thousands of flights every day where approximately 10+ TB of data is

generated in half an hour of a flight time.

Though the volume of data is very large and handling or managing it is a very challenging task but still Big Data has various

benefits associated with it. To state an example, organizations are able to improve their strategies of business since they get

access to data from Facebook, twitter and others. Using Big Data also helps to identify the risk related to product or services at

early stages, it helps to improve operational efficiency. To manage and handle this data a new design or framework was required.

This is Apache Hadoop. The various data processing and data management applications are developed using the framework,

Apache Hadoop.

The framework is designed such that it can be scaled to millions of servers starting from a single server. For a personal

computer, the local files system is used for storing data. On similar terms for big data a distributed file system called as HDFS

acronym for Hadoop Distributed File System. HADOOP, an open source scheme is used to build applications that run on

distributed data sets among commodity computer clusters. The main components of Hadoop are HDFS and MapReduce. HDFS

is the file storage component of HADOOP where data is stored in the form of blocks and is replicated multiple times so that data

blocks ae available on multiple computer nodes. The main purpose of replication is to ensure reliability and easy

rapid-accessibility for computations. HDFS has 3 types of nodes associated with it namely, Data Node, Name Node, and

Secondary Name Node. The Data Node is used to store data in the form of chunks in files. When the system is booted the Data

Node reports to the Name Node about all the chunks of data that it contains. The Name Node is like the central processing unit

of HDFS. It keeps track of all the locations and placements of file blocks or chunks on various data nodes. It is responsible for

producing and distributing replicas if required. Secondary Name Node is like a stand by for Name node when the existing Name

node crashes.

Map Reduce is backbone of Hadoop that provides the required scalability and easy data processing of huge volume of

data processing and storage. Map Reduce is a procedure which involves two phases namely Map phase and Reduce phase. The

input for MapReduce comes from Hadoop Distributed File System (HDFS). Here map tasks takes input the huge volume of data

or information and converts it into another form of data which is in the form of pairs called Keys- Value pairs. Then it is

followed by Reduce task whose input is the new representation of data by map task. Here a collection of reduce tasks are run in

parallel. MapReduce & Hadoop, its open-source utilization, have been received by driving organizations, for instance,

Facebook, Yahoo! and Google, for various big-data applications, for instance, digital advanced security and machine learning.

MapReduce segments an estimation calculation into 2 standard stages, naming them: Map & Reduce, which in this manner are

finished by a few couple of MapTasks & ReduceTasks, independently.

International Journal of Pure and Applied Mathematics Special Issue

12

2. Literature Survey

MapReduce is an implementation and a programming model that is utilized for processing and then to generate large

data sets. One can generate a map function which is used to process a key & value pair. Further this is followed by generation of

a collection of intermediate key and value pairs. It also involves a reduce function that is required for merging all the

intermediate values that are associated to the same key [1]. According to Jeffrey Dean and Sanjay Ghemawat, details related to

input data partitioning, program’s execution scheduling, machine failure handling and inter-machine communication

management is taken care by run-time system [1]. One of the scarcest resource is Network Bandwidth and hence locality

optimization is used, that is, writing of unique copy of data and reading of data from local drives or disks, to save network

bandwidth [1]. Weina Wang, Lei Ying, Kai Zhu, Li Zhang and Jian Tan in their paper proposed an algorithm that consists of

Max-Weight Policy and Join the Shortest Queue policy, which results in achieving full capacity region and the expected number

of tasks that are backlogged are minimized [2]. The algorithm was proved to be heavy freight optimal and throughput optimal

through simulations [2]. MapReduce is said to be a 3 phased algorithm. The 3 phases are Map, Shuffle and Reduce phases.

Authors Fangfei Chen, et al. proposed and implemented constant factor approximation algorithm that minimizes the weighted

response time [3]. In distributed computing, files are usually replicated and kept on multiple computers. This replication of files

leads to large consumption of storage space and making it crucial to reclaim the spaces used where possible. It was observed that

half of the space consumed is consumed by duplicate / replicated files [4]. Authors Dan Simon, Marvin Theimer, John R.

Douceur, William J. Bolosky and Atul Adya modeled a framework that was used to reclaim spaces from the duplication that

consisted of convergent encryption and SALAD [4]. The convergent encryption allows the replicated files to converge or come

together into single file space despite files being encrypted by different keys. SALAD is an acronym for Self-Arranging, Lossy,

Associative Database. SALAD is a name given to database that is for integrating contents of file and information about

placement in a fault tolerant, scalable and decentralized manner [4]. Since the resource sharing and storage outsourcing services

have come in demand, a problem to prove the data’s integrity that is stored on untrusted servers is aroused. In the PDP model,

where PDP stands for Provable Data Possession, the client keeps some amount of meta-data by performing preprocessing and

then sent for storage to an untrusted server. Further the client asks server to prove that the data stored has not been deleted or

tampered with [5].

3. Methodology

Map Reduce is backbone of Hadoop that provides the required scalability and easy data processing of huge volume of

data processing and storage. Map Reduce is a procedure which involves two phases namely Map phase and Reduce phase. The

input for MapReduce comes from Hadoop Distributed File System (HDFS). Here map tasks takes input the huge volume of data

or information and converts it into another form of data which is in the form of pairs called Keys- Value pairs. Then it is

followed by Reduce task whose input is the new representation of data by map task. Here a collection of reduce tasks are run in

parallel. MapReduce & Hadoop, its open-source utilization, have been received by driving organizations, for instance,

Facebook, Yahoo! and Google, for various big-data applications, for instance, digital advanced security and machine learning.

MapReduce segments an estimation calculation into 2 standard stages, naming them: Map & Reduce, which in this manner are

finished by a few couple of MapTasks & ReduceTasks, independently. In the system when a client makes a request, the process

shown in figure 1 takes place in order. It makes use of MVC architecture. MVC is mainstream as it disengages the application

logic from the UI layer and supports partition. Here the Controller gets all solicitations or say requests for the application and

after that works with the Model to set up any information required by the View. The information arranged by the Controller is

used by the View to produce a last adequate reaction. The architecture is shown in figure 2.When a client wants to retrieve some


13

data, the client has to make a request with the help of any browser. The first step is that the user requests for the data. The

Controller is used for communication between the model and view. The controller accepts the user request and selects the model

request along with suitable View response.

The Model is the component that is used to maintain data. The Model encapsulate the functionality and the content

structure and objects. The data from the model is further sent to the View module of the system. The update request is also

requested from the View module. The view module prepares data from model. Also when there are requests for updates from the

model, then the view manages it. The data that was requested for by the client at the browser gets the data in the form of HTML

data back to the Client. The data in HTML format has properties that makes the data readable and easy understandable.

Fig. 1 Data Request and retrieval process

4. Proposed System Architecture

The system architecture for the proposed system is explained in this section. The figure for the system architecture is

shown in the figure 2. The client is the end users who can upload or download a file. The File details are stored in the database

where the hash codes generated for them is stored along with block numbers. The Look-up function is applied at the server

which involves the request to look whether server is busy and then redirect the request to another server. The LBA (Logical

Block Addressing) is used at the server itself. The LBA in the system is a scheme that is used for keeping information about

location of storage of blocks of data that are formed from files. Regional 1 LBA and Regional 2 LBA are the two sub regions

that are differentiated by the IP addresses that they get requests from. Here in IP look up details file names and its associated

hash tag for data are stored and matched for every new file upload or download request.


14

Fig. 2 Proposed System Architecture

Key features illustrated in the architecture are:

LBA (Logical Block Addressing) in the system is a scheme that is used for keeping information about location of storage

of blocks of data that are formed from files.

Regional 1 LBA and Regional 2 LBA are the two sub regions that are differentiated by the IP addresses that they get

requests from.

Look up Function: Here in IP look up details file names and its associated hash tag for data are stored and matched for

every new file upload or download request.

The proposed framework has two modules. These modules are explained in brief as follows. There are two modules

Admin Module: As the name suggests, admin is the administrator who has all rights. Admin account is the type of super

account that has unique functionalities. Admin account has the privilege to view all clients’ details, hash details and

cloud details. Hash details refers to the hash codes generated for the files that are uploaded. Admin account also has the

option to view transactions of all clients. Admin can add, edit or delete users.

User Module : Client is referred to the end users, one who can upload or request to download files. Client can view and

edit his details. One can upload file or request to download a file. Uploading a file leads to the generation of hash code

for the data in the file. A client can view his transactions.

5. Proposed System Architecture


15

This section consists of most of the screen shots images of the user interface that is created for the proposed

methodology. The home page of the user interface created for the proposed design of the system consists of admin interface and

user interface login. Here, as we can see there are 2 options available to login for. If a person is admin then he can select Admin

else if the person is general user then he can login using user option. When Admin clicks on User Details button, this screen

image appears. Here the admin can view all the user those are registered. Admin gets to see the details of each user. Figure 3

illustrated that when a new user has to be registered then Admin can click on the Add button. Clicking on Add Button leads to

a screen where new User’s details have to be filled up. Once Register button is clicked a new user is registered.

Fig. 3 User Registration

Figure 4 illustrated that when the Admin clicks on Hash Details button the above screen image appears. This screen

displays the Hash code generated for the contents of the file that are uploaded by the user. The hash codes that are generated

for blocks of data content in the file is displayed on this screen.

Fig. 4 Details of Hash Table

In some cases, the data in the text files could be same but the file name could be different. In such cases the count in the

instance column is increased and no duplicate hash code is generated, hence reducing storage cost. When Admin selects a

user name in the drop down list, the screen displays the all the transaction details of the user whom the Admin has selected.

It displays all details of a particular user, shown in the figure 5.

Transactions that are displayed includes when that user logged in with date and time, when that user uploaded or

downloaded any file with file name along with date and time and also when user logged out.


16

Fig. 5 User named ”abc”

Figure 6 shows that, when User clicks on the Show Profile button. User details such as ID no., Name, User ID,

Mail ID, Cell No. and Address are displayed of the User logged in

Fig. 6 User Profile

Figure 7 illustrated that, when the user clicks on Upload File or Download File, then the above screen image

appears. Here we can see the list of files that is uploaded by the user.


17

Fig. 7 File Details

When a user clicks on the Upload button, the user gets a pop up that has an option to choose file. This is

provided for the user to select the text file to be uploaded, shown in Figure 8.

Fig. 8 File Upload

Fig. 9 Selection of file


18

Figure 9 shows that the user can browse through his file storage system to select or say choose a file to be

uploaded.

Fig. 10 File Uploading Process

The file gets uploaded and a pop up message displaying “File Uploaded Successfully” appears which ensures

the successful upload of file, shown in figure 10. Figure 11 shows that, If a user clicks on the button unnecessarily a

validation is done that checks whether a file is selected for deleting, else a pop up with message “Oops select at least one to

delete” appears.

Fig. 11 File Deletion

Figure 12 shows that, When a user clicks on Download File button, the above screen appears where the list of available

file are listed from where a user can select which file to download.


19

Fig. 12 File list available for Download

In this section we discussed about how the interface is created for admin and user for the proposed methodology.

6. Conclusion This paper concludes that the network traffic can be reduced in both offline and online cases. This is observed by the

proposed methodology where there is an optimization of redundant files in the storage. This is achieved by the IP look up

concepts where we are doing a file redirect to reduce the network traffic when requested cloud region is busy. Also when the

files are saving multiple times by the different servers, the replication control concept starts implementing where replicated

multiple files cannot be saved in the cloud which is achieved by the MD5 the hash code generation.

References

[1] Dean, J., & Ghemawat, S. (2008). MapReduce: simplified

data processing on large clusters. Communications of the ACM, 51(1), 107-113.

[2] Wang, W., Zhu, K., Ying, L., Tan, J., & Zhang, L. (2016).

Maptask scheduling in mapreduce with data locality: Throughput and heavy-traffic optimality. IEEE/ACM Transactions on

Networking, 24(1), 190-203.

[3] Chen, F., Kodialam, M., & Lakshman, T. V. (2012, March).

Joint scheduling of processing and shuffle phases in mapreduce systems. In INFOCOM, 2012 Proceedings IEEE (pp.

1143-1151). IEEE.

[4] Douceur, J. R., Adya, A., Bolosky, W. J., Simon, P., &

Theimer, M. (2002). Reclaiming space from duplicate files in a serverless distributed file system. In Distributed Computing

Systems, 2002.

[5] Proceedings. 22nd International Conference on (pp.

617-624). IEEE. Erway, C. C., Küpçü, A., Papamanthou, C., & Tamassia, R. (2015).

[6] Dynamic provable data possession. ACM Transactions on

Information and System Security (TISSEC), 17(4), 15.

[7] Diffie, W., and Hellman, M.E., New Directions in

Cryptography, IEEE Transactions on Information Theory, vol. 22, no. 6, November 1976, pp.

[8] Garret, Paul. Making, Breaking Codes: An Introduction to

Cryptology. Upper Saddle River, NJ: Prentice-Hall, 2001.

[9] Kurose, James F., Ross, Keith W., Computer Networking: A

top Down Approach Featuring the Internet. 2nd edition. Addison Wesley 2002.


20

[10] J. Lenstra, A. Rinnooy Kan, and P. Brucker, “Complexity of

machine scheduling problems,” Annals of Discrete Mathematics, vol. 1, pp. 343–362, 1977.

[11] J. Lenstra and A. Kan, “Complexity of scheduling under

precedence constraints,” Operations Research, pp. 22–35, 1978.

[12] P. Brucker, Scheduling algorithms. Springer, 2004.

[13] Q. Xie and Y. Lu, “Degree-guided map-reduce task

assignment with data locality constraint,” in Proc. IEEE ISIT, Cambridge, MA, USA,2012, pp. 985–989.

[14] J. Tan, X. Meng, and L. Zhang, “Coupling task progress for

MapReduce resource-aware scheduling,” in Proc. IEEE INFOCOM, Turin, Italy, Apr. 2013, pp. 1618–1626.

[15] M. Isard et al., “Quincy: Fair scheduling for distributed

computing clusters,” in Proc. ACM SOSP, Big Sky, MT, USA, 2009, pp. 261–276.

[16] Rajesh, M., and J. M. Gnanasekar. "Congestion control in

heterogeneous WANET using FRCC." Journal of Chemical and Pharmaceutical Sciences ISSN 974 (2015): 2115.

[17] Rajesh, M., and J. M. Gnanasekar. "A systematic review of

congestion control in ad hoc network." International Journal of Engineering Inventions 3.11 (2014): 52-56.

[18] Rajesh, M., and J. M. Gnanasekar. " Annoyed Realm Outlook

Taxonomy Using Twin Transfer Learning." International Journal of Pure and Applied Mathematics 116.21 (2017)

547-558.

[19] Rajesh, M., and J. M. Gnanasekar. " Get-Up-And-Go

Efficientmemetic Algorithm Based Amalgam Routing Protocol." International Journal of Pure and Applied Mathematics

116.21 (2017) 537-547.

[20] Rajesh, M., and J. M. Gnanasekar. " Congestion Control

Scheme for Heterogeneous Wireless Ad Hoc Networks Using Self-Adjust Hybrid Model." International Journal of Pure

and Applied Mathematics 116.21 (2017) 537-547.


21

on traffic -aware partition and aggregation ...mapreduce & hadoop, its open -source utilization,...

Documents