big data processing: performance gain through in-memory computation

Report on the project

Big Data Processing: Performance Gain Through

In-Memory Computation By

Group4: David Holland, Joy Rahman, Prosunjit Biswas Rehana Begam, Yang Zhou

Introduction:

The main objective of this project is to analyze the performance gain in the Big Data processing through the in-memory computation. We tried to understand the Hadoop MapReduce and Spark in-memory frameworks, gathered the execution time required for a benchmark to run on both of them and analysed those results to get the performance gain achieved by Spark.

Background and Motivation:

The rapid development of the Internet has generated vast amount of data that poses big challenges to traditional data processing model. To deal with such challenges, a variety of cluster computing frameworks have been proposed to support large scale data intensive applications on commodity machines. MapReduce, introduced by Google is one such successful framework for processing large data sets in a scalable, reliable and fault-tolerant manner.

Apache Hadoop provides an open source implementation of MapReduce. It is a very popular general purpose framework for distributed storage and distributed processing of Big Data on clusters of commodity hardware. It is used for many different classes of data-intensive

applications. For processing the data, the Hadoop MapReduce ships code to the nodes that have the required data, and the nodes then process the data in parallel. This approach takes advantage of the data locality. The term Hadoop often indicates the "Hadoop Ecosystem" which means the combination of Hadoop and all of the additional software packages that can be installed on top of or alongside Hadoop, such as Pig, Hive, HBase, Spark and others.

Spark is an emerging framework or compute engine for Hadoop data. It provides a very simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation. Spark is designed to have a global cache mechanism and can achieve better performance in terms of response time because of its in-memory access over the distributed machines of cluster.

Hadoop MapReduce is not good for iterative operations because of the cost paid for the data reloading from disk at each iteration. MapReduce cannot keep reused data and state information during execution. Thus, MapReduce reads the same data iteratively and materializes intermediate results in local disks in each

iteration, requiring lots of disk accesses, I/Os and unnecessary computations. On the other hand, Spark offers better execution time by caching intermediate data in-memory for iterative operations. Most ML algorithms run on the same data set iteratively. In MapReduce, there was no easy way to communicate shared states and data from iterations to iterations. Spark is designed to overcome the shortages of MapReduce in iterative operations. Through the use of the data structure called Resilient distributed datasets (RDDs), Spark can effectively improve the performance of the iterative jobs with low latency requirements.

In this project, we attempted to conduct exhaustive experiments to evaluate the system performance between Hadoop MapReduce and Spark. We considered the execution time as the performance matric. We choose a typical iterative algorithm "PageRank" to run for some real data sets on both of the frameworks.

Experimental Environment:

I. Cluster Architecture

The experimental cluster is composed of six computers. One of them is designated as master, and the other five as slaves. We use the operating system Ubuntu 12.04.2 (GNU/Linux 3.5.0-28-generic x86 64) for all the computers.

Table 1 shows the hostname, machine modal, IP address, CPU and memory information of the computers. We use Hadoop 1.2.1 and Spark 1.1.0 for all the experiments. Figure 1 shows the overall testbed architecture of our system.

II. Dataset Description

We choose four real graph datasets to do comparative experiments. Table 2 lists these graph datasets. They are all in the format of edge list, each line in the file is a [src ID] [target ID] pair separated by whitespace. These real graph datasets come from SNAP.

Figure 1: Testbed Architecture

Implementation:

I. Benchmark: PageRank

PageRank is an algorithm used by Google Search to rank websites in their search engine results. PageRank was named after Larry Page, one of the founders of Google. It is a way of measuring the importance of website pages. It works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

II. Execution Model for MapReduce

In Figure 2 we can see the steps for

setting up the HDFS for out input

datasets. In the execution model for

MapReduce, given input files for page

ranking algorithm, we distribute the data

over the hadoop cluster and run three

Hostname Machine IP CPU info Memory

master Hadoop-6 10.0.0.11 1 2 GB

Slave0-slave4 Hadoop-2-5 10.0.0.2/4/5/9 1 2 GB

Table 1: Information of machines in the cluster

Name File size Nodes Edges Description wiki-Vote 1.0 MB 7,115 103,689 Wikipedia who-votes-on-whom network

p2p-Gnutella31 10.8 MB 62,586 147,892 Gnutella peer to peer network from

August 31 2002

soc-Epinions1 5.531 MB 75,879 508,837 Who-trusts-whom network of

Epinions.com soc-

Slashdot0811 10.496

MB 77,360 905,468

Slashdot social network from November 2008

web-Google 71.9 MB 875,713 5,105,039 Web graph from Google

Table 2: Graph Datasets

different MapReduce jobs on the data.

Figure 3 gives a summary of the jobs.

III. Execution Model for Spark

In this case, we used the same dataset and

HDFS configuration as for the execution

model of MapReduce. We rewrite

MapReduce jobs taking advantage of

Sparks RDD. Summary of this work is

given in Figure 4.

Figure 2: HDFS setup

Figure 3: Execution model for Hadoop

Figure 4: Execution model on Spark

Experimental Results:

For each dataset we make PageRank run

with different number of iterations on

Hadoop and Spark to see how that effects

the performance. We then record the total

running time each dataset spends for each

number of iterations. We finally stop at

five iterations for all the datasets instead

of their convergence iteration time

because five iterations is long enough to

help us quantify the time differences

between Hadoop and Spark.

Figure 5 shows us the running time

required for each of the dataset when the

iteration number = 1. As we can see there

is not much improvement achieved with

the Spark as the iteration number is small.

In Figure 6 and 7, we show the results for

PageRank with the same datasets when

the iteration number = 2, 3 respectively.

As we can see, running time for Hadoop is

increasing and Spark is outperforming

Hadoop for all of the datasets.

The graph in Figure 8 is the running time

results for the benchmark when the

iteration number = 5. From this figure we

can find that Spark is performing better

than Hadoop MapReduce jobs.

We also tried to compare them with large

datasets but found that though they can

be handled by MapReduce, Spark cannot

as it do not get sufficient memory as per

requirement to run the benchmark for

them.

Figure 5: Running time comparison, iter num=1


While we were considering web-Google, a

large dataset (71.9 MB) with 875,713

nodes and 5,105,039 edges, we found that

the running time for Spark is much higher

than that of MapReduce. Figure 9 shows

the corresponding graph. The rise in the

running time of Spark may be caused by

the virtualization overhead.

Figure 9: Running time for web-Google dataset

Figure 10: Console output of a Spark run

Figure 11: Console output with memory error in Spark

The above two figures (Figure 10 and 11)

shows the typical output consoles for the

Spark runs. Figure 10 shows the output of

a successful run with the total running

time required.

In Figure 11, we can see the "not enough

space to cache partition rdd_x_y in

memory" error for cit-Patents, a larger

dataset (267.5 MB) with 3,774,768 nodes

and 16,518,948 edges. This means that

with our existing cluster configuration

and insufficient memory, Spark cannot

carry out PageRank benchmark with a

bigger dataset.

Conclusion:

In this project, we worked with Hadoop

MapReduce and Spark to compare the

performance gain in terms of running

time and memory consumption. We found

that, though for small datasets Spark

performs better, for large datasets

MapReduce is much efficient even with

insufficient memory. Spark needs enough

memory for the correct execution of the

benchmark. Without that it can take

longer time and even crash.

If speed is not a demanding requirement

and we do not have abundant memory,

we should not choose Spark. As long as

we have enough disk space to

accommodate the original dataset and

intermediate results, Hadoop MapReduce

is a good choice.

References:

[1] M. Zaharia, M. Chowdhury, S. S. Michael J.

Franklin, and I. Stoica, “Spark: Cluster computing

with working sets,” In HotCloud, June 2010.

[2] Lei Gu, Huan Li, “Memory or Time:

Performance Evaluation for Iterative Operation on

Hadoop and Spark” In 2013 IEEE International

Conference on High Performance Computing and

Communications & 2013 IEEE International

Conference on Embedded and Ubiquitous

Computing, 2013.

[3] SNAP url: http://snap.stanford.edu/data/

[4] hortonworks.com/hadoop-tutorial/using-

commandline-manage-files-hdfs/

[5]stackoverflow.com/questions/24167194/why-

is-the-spark-task-running-on-a-single-node

big data processing: performance gain through in-memory computation

Technology