mapreduce for machine learning

MapReduce for Machine learning seminar report 2015

1 Introduction

Frequency scaling on silicon—the ability to drive chips at ever higher clock rates—is beginning to hit a power limit as device geometries shrink due to leakage, and simply because CMOS consumes power every time it changes state. Yet Moore’s law, the density of circuits doubling every generation, is projected to last between 10 and 20 more years for silicon based circuits, but doubling the number of processing cores on a chip, one can maintain lower power while doubling the speed of many applications. This has forced an industry-wide shift to multicore.

We thus approach an era of increasing numbers of cores per chip, but there is as yet no good frame-work for machine learning to take advantage of massive numbers of cores. There are many parallel programming languages such as Orca, Occam ABCL, SNOW, MPI and PARLOG, but none of these approaches make it obvious how to parallelize a particular algorithm. There is a vast literature on distributed learning and data mining, but very little of this literature focuses on our goal: A general means of programming machine learning on multicore. Much of this literature contains a long and distinguished tradition of developing (often ingenious) ways to speed up or parallelize individual learning algorithms, for instance cascaded parallelization technique for machine learning and, more pragmatically, specialized implementations of popular algorithms rarely lead to widespread use. Some examples of more general papers are: Caregea et. al. give some general data distribution conditions for parallelizing machine learning, but restrict the focus to decision trees; Jin and Agrawal give a general machine learning programming approach, but only for shared memory machines. This doesn’t fit the architecture of cellular or grid type multiprocessors where cores have local cache, even if it can be dynamically reallocated.

In this paper, we focuses on developing a general and exact technique for parallel programming of a large class of machine learning algorithms for multicore processors. The central idea of this approach is to allow a future programmer or user to speed up machine learning applications by “throwing more cores” at the problem rather than search for specialized optimizations. This paper’s contributions are :(i) We show that any algorithm fitting the Statistical Query Model may be written in a certain “summation form.” This form does not change the underlying algorithm and so is not an approximation, but is instead an exact implementation. (ii) The summation form does not depend on, but can be easily expressed in a map-reduce framework which is easy to program in. (iii) this technique achieves basically linear speed-up with the number of cores.

Dept. of MCA LBSCEK


2 Machine learning

“Machine learning” sounds mysterious for most people. Indeed, only a small fraction of

professionals really know what it stands for. And there is a serious reason for it – this field is

rather technical and difficult to explain to a layman. However, we would like to bridge this gap

and explain a bit about what machine learning (ML) is and how it can be used in our everyday

life or business.

So what is this mysterious ML?

Machine learning can refer to:

• The branch of artificial intelligence;

• The methods used in this field (there are a variety of different approaches).

Overall, if talking about the latter, Tom Mitchell, author of the well-known book “Machine

learning”, defines ML as “improving performance in some task with experience”. However, this

definition is quite a broad one, so we can quote another more specific description stating that ML

deals with systems that can learn from data.

ML works with data and processes it to discover patterns that can be later used to analyze new

data. ML usually relies on specific representation of data, a set of “features” that are

understandable for a computer. For example, if we are talking about text it should be represented

through the words it contains or some other characteristics such as length of the text, number of

emotional words etc. This presentation depends on the task you are dealing with and is typically

referred to as “feature extraction”.

Types of ML

All ML tasks can be classified in several categories, the main ones are:

• supervised ML;

• Unsupervised ML;

• Reinforcement learning.

Dept. of MCA LBSCEK


Now let us explain in simple words the kind of problems that are dealt with by each category.

Supervised ML relies on data where the true label/class was indicated. This is easier to explain

using an example. Let us imagine that we want to teach a computer to distinguish pictures of cats

and dogs. We can ask some of our friends to send us pictures of cats and dogs adding a tag ‘cat’

or ‘dog’. Labeling is usually done by human annotators to ensure a high quality of data. So now

we know the true labels of the pictures and can use this data to “supervise” our algorithm in

learning the right way to classify images. Once our algorithm learns how to classify images we

can use it on new data and predict labels (‘cat’ or ‘dog’ in our case) on previously unseen

images.

2.1 Applications to Machine Learning

Many standard machine learning algorithms follow one of a few canonical data processing

patterns, which we outline below. A large subset of these can be phrased as MapReduce tasks,

illuminating the benefits that the MapReduce framework offers to the machine learning

community.

In this section, we investigate the performance trade-offs of using MapReduce from an algorithm

centric perspective, considering in turn three classes of ML algorithms and the issues of adapting

each to a MapReduce framework. The performance that results depends intimately on the design

choices underlying the MapReduce implementation, and how well those choices support the data

processing pattern of the ML algorithm. We conclude this section with a discussion of changes

and extensions to the Hadoop MapReduce implementation that would benefit the machine

learning community.

2.1.1 A Taxonomy of Standard Machine Learning Algorithms

While ML algorithms can be classified on many dimensions, the one we take primary interest in

here is that of procedural character: the data processing pattern of the algorithm. Here, we

consider single-pass, iterative and query-based learning techniques, along with several example

algorithms and applications.

Dept. of MCA LBSCEK


2.1.2 Single-pass Learning

Many ML applications make only one pass through a data set, extracting relevant statistics for

later use during inference. This relatively simple learning setting arises often in natural language

processing, from machine translation to information extraction to spam filtering. These

applications often fit perfectly into the MapReduce abstraction, encapsulating the extraction of

local contributions to the map task, then combining those contributions to compute relevant

statistics about the dataset as a whole. Consider the following examples, illustrating common

decompositions of these statistics.

Estimating Language Model Multinomial: Extracting language models from a large corpus

amounts to little more than counting n-grams, though some parameter smoothing over the

statistics is also common. The map phase enumerates the n-grams in each training instance

(typically a sentence or paragraph), and the reduce function counts instances of n-grams.2This

option has been investigated as part of Alex Rasmussen’s Hadoop-related CS 262 project this

semester.

Feature Extraction for Naive Bayes Classifiers: Estimating parameters for a naive Bayes

classifier, or any fully observed Bayes net, again requires counting occurrences in the training

data. In this case, however, feature extraction is often computation-intensive, perhaps involving

small search or optimization problems for each datum. The reduce task, however, remains a

summation of each (feature, label) environment pair.

Syntactic Translation Modeling: Generating a syntactic model for machine translation is an

example of a research-level machine learning application that involves only a single pass through

a preprocessed training set. Each training datum consists of a pair of sentences in two languages,

an estimated alignment between the words in each, and an estimated syntactic parse tree for one

sentence.3 The per-datum feature extraction encapsulated in the map phase for this task involves

search over these coupled data structures.

2.1.3 Iterative Learning

The class of iterative ML algorithms – perhaps the most common within the machine learning

research community – can also be expressed within the framework of MapReduce by chaining

Dept. of MCA LBSCEK


together multiple MapReduce tasks. While such algorithms vary widely in the type of operation

they perform on each datum (or pair of data) in a training set, they share the common

characteristic that a set of parameters is matched to the data set via iterative improvement. The

update to these parameters across iterations must again decompose into per-datum contributions,

which is the case of the example applications below. As with the examples discussed in the

previous section, the reduce function is considerably less compute-intensive than the map tasks.

In the examples below, the contribution to parameter updates from each datum (the mapfunction)

depends in a meaningful way on the output of the previous iteration. For example, the

expectation computation of EM or the inference computation in an SVM or perceptron classifier

can reference a large portion or all of the parameters generated by the algorithm. Hence, these

parameters must remain available to the map tasks in a distributed environment. The information

necessary to compute the map step of each algorithm is described below; the complications that

arise because this information is vital to the computation are investigated later in the paper.

Expectation Maximization (EM): The well-known EM algorithm maximizes the likelihood of

a training set given a generative model with latent variables. The E-step of the algorithm

computes posterior distributions over the latent variables given current model parameters and the

observed data. The maximization step adjusts model parameters to maximize the likelihood of

the data assuming that latent variables take on their expected values. Projecting onto the

MapReduce framework, the map task computes posterior distributions over the latent variables

of a datum using current model parameters; the maximization step is performed as a single

reduction, which sums the sufficient statistics and normalizes to produce updated parameters.

We consider applications for machine translation and speech recognition. For multivariate

Gaussian mixture models (e.g., for speaker identification), these parameters are simply the mean

vector and a covariance matrix. For HMM-GMM models (e.g., speech recognition),parameters

are also needed to specify the state transition probabilities; the models, efficiently3Generating

appropriate training data for this task involves several applications of iterative learning

algorithms, described in the following section.

Dept. of MCA LBSCEK


Stored in binary form, occupy tens of megabytes. For word alignment models (e.g., machine

translation), these parameters include word-to-word translation probabilities; these can number

in the millions, even after pruning heuristics remove the unnecessary parameters.

Discriminative Classification and Regression: When fitting model parameters via a

perceptron, boosting, or support vector machine algorithm for classification or regression, the

map stage of training will involve computing inference over the training example given the

current model parameters. Similar to the EM case, a subset of the parameters from the previous

iteration must be available for inference. However, the reduce stage typically involves summing

over parameter changes.

Thus, all relevant model parameters must be broadcast to each map task. In the case of a typical

featured setting that often extracts hundreds or thousands of features from each training example,

the relevant parameter space needed for inference can be quite large.

2.1.4 Query-based Learning with Distance Metrics

Finally, we consider distance-based ML applications that directly reference the training set

during inference, such as the nearest-neighbor classifier. In this setting, the training data are the

parameters, and a query instance must be compared to each training datum.

While it is the case that multiple query instances can be processed simultaneously within a

MapReduce implementation of these techniques, the query set must be broadcast to all map

tasks. Again, we have a need for the distribution of state information. However, in this case, the

query information that must be distributed to all map tasks needn’t be processed concurrently – a

query set can be broken up and processed over multiple MapReduce operations. In the examples

below, each query instance tends to be of a manageable size.

K-nearest Neighbors Classifier The nearest-neighbor classifier compares each element of a query

set to each element of a training set, and discovers examples with minimal distances from the

queries. The map stage computes distance metrics, while the reduce stage tracks k examples for

each label that have minimal distance to the query.

Dept. of MCA LBSCEK


Similarity-based Search Finding the most similar instances to a given query has a similar

character, sifting through the training set to find examples that minimize a distance metric.

Computing the distance is the map stage, while minimizing it is the reduce stage.

2.2 Performance and Implementation Issues

While the algorithms discussed above can all be implemented in parallel using the MapReduce abstraction, our example applications from each category revealed a set of implementation challenges.

We conducted all of our experiments on top of the Hadoop platform. In the discussion below, we

will address issues related both to the Hadoop implementation of MapReduce and the

MapReduce framework itself.

2.2.1 Single-pass Learning

The single-pass learning algorithms described in the previous section are clearly amenable to the

MapReduce framework. We will focus here on the task of generating a syntactic translation

model from a set of sentence pairs, their word-level bilingual alignment, and their syntactic

structure

160

140Local MapReduce

3-node MapReduce120

Local Reference

Sec

onds

100

80

60

40

20

00 10 20 30 40 50 60

Training (Thousands of sentence pairs)

Fig1: the benefit of distributed computation quickly outweighs the overhead of a mapReduce implementation on a 3-node cluster

Figure1 shows the running times for various input sizes, demonstrating the overhead of running

Dept. of MCA LBSCEK


MapReduce relative to the reference implementation. The cost of running Hadoop cancels out

some of the benefit of parallelizing the code. Specifically, running on 3 machines gave a speed-

up of 39% over the reference implementation. The overhead of simulating a MapReduce

computation on a single machine was 51% of the compute cost of the reference implementation.

Distributing the task to a large cluster would clearly justify this overhead, but parallelizing to two

machines would give virtually no benefit for the largest data set size we tested.

A more promising metric shows that as the size of the data scales, the distributed MapReduce

implementation maintains a low per-datum cost. We can isolate the variable-cost overhead of

each example by comparing the slopes of the curves in figure 1, which are all near-linear. The

reference implementation shows a variable computation cost of 1.7 seconds per 1000 examples,

while the distributed implementation across three machines shows a cost of 0.5 seconds per 1000

examples. So, the variable overhead of MapReduce is minimal for this task, while the static

overhead of distributing the code base and channeling the processing through Hadoop’s

infrastructure is large. We would expect that substantially increasing the size of the training set

would accentuate the utility of distributing the computation.

Thus far, we have assumed that the training data was already written to Hadoop’s distributed file

System (DFS). The cost of distributing data is relevant in this setting, however, due to drawbacks

of Hadoop’s implementation of DFS. In the simple case of text processing, a training corpus

need only be distributed to Hadoop’s file system once, and can be processed by many different

applications.

On the other hand, this application references four different input sources, including sentences,

alignments and parse trees for each example. When copying these resources independently to the

DFS, the Hadoop implementation gives no control over how those files are mapped to remote

machines. Hence, no one machine necessarily contains all of the data necessary for a given

example.

Apache mahout is a framework for implementing machine learning in mapreduce paradigm.

3 Apache mahout

Apache Mahout is a project of the Apache Software Foundation to produce free implementations

of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of

Dept. of MCA LBSCEK


collaborative filtering, clustering and classification. Many of the implementations use the

Apache Hadoop platform. Mahout also provides Java libraries for common maths operations

(focused on linear algebra and statistics) and primitive Java collections. Mahout is a work in

progress; the number of implemented algorithms has grown quickly. But various algorithms are

still missing.

While Mahout's core algorithms for clustering, classification and batch based collaborative

filtering are implemented on top of Apache Hadoop using the map/reduce paradigm, it does not

restrict contributions to Hadoop based implementations. Contributions that run on a single node

or on a non-Hadoop cluster are also welcomed. For example, the 'Taste' collaborative-filtering

recommender component of Mahout was originally a separate project and can run stand-alone

without Hadoop. Integration with initiatives such as the Pregel-like graphs is actively under

discussion.

Mahout Algorithms include many new implementations built for speed on Mahout-Samsara.

They run on Spark and some on H2O, which means as much as a 10x speed increase. You’ll find

robust matrix decomposition algorithms as well as a Naive Bayes classifier and collaborative

filtering. The new spark-item similarity enables the next generation of co-occurrence

recommenders that can use entire user click streams and context in making recommendations.

3.1 Mahout installation

Scalable to reasonably large data sets. Our core algorithms for clustering, classification and batch

based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce

paradigm. However we do not restrict contributions to Hadoop based implementations:

Contributions that run on a single node or on a non-Hadoop cluster are welcome as well. The

core libraries are highly optimized to allow for good performance also for non-distributed

algorithms. Scalable to support your business case. Mahout is distributed under a commercially

friendly Apache Software license. Scalable community. The goal of Mahout is to build a vibrant,

responsive, diverse community to facilitate discussions not only on the project itself but also on

potential use cases. Come to the mailing lists to find out more. Currently Mahout supports

mainly four use cases: Recommendation mining takes users' behavior and from that tries to find

Dept. of MCA LBSCEK


items users might like. Clustering takes e.g. text documents and groups them into groups of

topically related documents. Classification learns from existing categorized documents what

documents of a specific category look like and is able to assign unlabeled documents to the

(hopefully) correct category. Frequent item set mining takes a set of item groups (terms in a

query session, shopping cart content) and identifies, which individual items usually appear

together.

3.1.1 Apache Mahout Integration

The flexibility and power of Apache Mahout (http://mahout.apache.org/) in conjunction with

Hadoop is invaluable. Therefore, I have packaged the most recent stable release of Mahout (0.5),

and very excited to work with the Mahout community becoming much more involved with the

project as both Mahout& Hadoop continue to grow.

3.1.2 Why we are packing Mahout with Hadoop

Machine learning is an entire field devoted to Information Retrieval, Statistics, Linear Algebra,

Analysis of Algorithms, and many other subjects. This field allows us to examine things such as

recommendation engines involving new friends, love interests, and new products. We can do

incredibly advanced analysis around genetic sequencing and examination, distributed search and

frequency pattern matching, as well mathematical analysis with vectors, matrices, and singular

value decomposition (SVD). Apache Mahout is an open source project that is a part of the

Apache Software Foundation, devoted to Machine Learning. Mahout can operate on top of

Hadoop, which allows the user to apply the concept of Machine Learning via a selection of

algorithms in Mahout to distributed computing via Hadoop. Mahout packages popular machine

learning algorithms such as:

Recommendation mining takes users’ behavior and finds items said specified user might like.

Clustering, takes e.g. text documents and groups them based on related document topics.

Classification learns from existing categorized documents what specific category documents

look like and is able to assign unlabeled documents to the appropriate category.

Dept. of MCA LBSCEK


Frequent item set mining, takes a set of item groups (e.g. terms in a query session, shopping

cart content) and identifies, which individual items typically appear together.

We are very excited to be working with the Apache Mahout community and highly encourage

everyone who is using CDH currently to give Mahout a try! As always, we are open to any

guests who would like to blog about their experience using Mahout with CDH.

3.2 Installing Mahout

Mahout is an acquisition of highly scalable machine learning algorithms over very large data

sets. Although the real power of Mahout can be vouched for only on large HDFS data, but

Mahout also supports running algorithm on local file system data that can help you get a feel of

how to run Mahout Algorithms. Before you can run any Mahout algorithm you need a Mahout

Installation ready on your Linux machine which can be carried out in two ways as described

below

Step 1:-

We will download mahout-distribution-0.x.tar.gz Download mahout-distribution-

0.x.tar.gz from the Apache Download Mirrors and extract the contents of the Hadoop package to

a location of your choice. I picked /usr/local/hadoop. Make sure to change the owner of all the

files to the hduser user and hadoop group, for example:

cd /usr/local

$ sudo tar xzf mahout-distribution-0.x.tar.gz

$ sudo mv mahout-distribution-0.x.tar.gz to mahout

$ sudo chown -R hduser:hadoop mahout

This should result in a folder with name

Dept. of MCA LBSCEK


/path_to_downloaded_tarball/mahout-distribution-0.x

Now, you can run any of the algorithms using the script “bin/mahout” present in the extracted

folder. For testing your installation, you can also run

bin/mahout

without any other arguments.

Now we will set the path in the .bashrc file

export MAHOUT_HOME=/usr/local/mahout

path=$path:$MAHOUT_HOME/bin

Step 2:-

Create a directory where you would want to check out the

Mahout code, we’ll call it here MAHOUT_HOME:

$ sudo mkdir -p /app/mahout

$ Sudo chown hduser: hadoop /app/mahout

# ...and if you want to tighten up security, chmod from 755 to 750...

$ sudo chmod 750 /app/mahout

Step: 3-

Dept. of MCA LBSCEK


Now we will set Hadoop_confi path in in .hadoop.env.sh

/usr/local/mahout/lib/*

Step 4:-

Now we have INSTALLATION OF MAVEN.

$ sudo tar xzf apache-maven-2.0.9-bin.tar.gz

$ sudo mv apache-maven-2.0.9-bin.tar.gz maven

$ sudo chown -R hduser:hadoop maven

Now we will set the path in the .bashrc file

export M2_HOME=/usr/local/maven

path=$path:$M2_HOME/bin

Now we can Start Mahout by .bin/mahout command now we have complete mahout

installation. Now we can use mahout with hadoop configuration.

Dept. of MCA LBSCEK


Fig2: showing maven version

4 Advantages and Disadvantages

Advantages

The paper "Map-Reduce for Machine Learning on Multicore" shows 10 machine learning

algorithms, which can benefit from map reduce model. The key point is "any algorithm fitting

the Statistical Query Model may be written in a certain “summation form.”, and the algorithms

can be expressed as summation form can apply map reduce programming model.

Disadvantages

The MapReduce does not work when there are computational dependencies in the data. This

limitation makes it difficult to represent algorithms that operate on structured models.

As a consequence, when confronted with large scale problems, we often abandon rich structured

models in favor of overly simplistic methods that are amenable to the MapReduce abstraction.

In Machine-learning community, numerous algorithms iteratively transform parameters during

both learning and inference, e.g., Belief Propagation, Expectation Maximization, Gradient

Descent and Gibbs Sampling. Those algorithms iteratively refine a set of parameters until some

termination criteria is matched.

If you invoke MapReduce in each iteration, you still can speed up the computation. The point here is

that we want a better abstraction framework so that it's possible to embrace the graphical structure of

data, to express sophisticated scheduling or automatically assess termination.

Dept. of MCA LBSCEK


6 Conclusions

By virtue of its simplicity and fault tolerance, MapReduce proved to be an admirable

gateway to parallelizing machine learning applications. The benefits of easy development

and robust computation did come at a price in terms of performance, however. Furthermore,

proper tuning of the system led to substantial variance in performance.

Defining common settings for machine learning algorithms lead us to discover the core

shortcomings of Hadoop’s MapReduce implementation. We were able to address the most

significant, the need to broadcast data to map tasks. We also greatly improved the

convenience of running MapReduce jobs from within existing Java applications. However,

the ability to tie together the distribution of parallel files on the DFS remains an outstanding

challenge.

All in all, MapReduce represents a promising direction for future machine learning

implementations. When continuing to develop Hadoop and tools that surround it, we must

strive to minimize the compromises between convenience and performance, providing a

platform that allows for efficient processing and rapid application development.

Dept. of MCA LBSCEK


5 References

http://mahout.apache.org/

https://chameerawijebandara.wordpress.com/2014/01/03/install-mahout-in-ubuntu-

for-beginners/

http://nivirao.blogspot.com/2012/04/installing-apache-mahout-on-ubuntu.html

https://help.ubuntu.com/community/Java

http://mahout.apache.org/developers/buildingmahout.html

Dept. of MCA LBSCEK

http://mahout.apache.org/developers/buildingmahout.html

https://help.ubuntu.com/community/Java

http://nivirao.blogspot.com/2012/04/installing-apache-mahout-on-ubuntu.html

https://chameerawijebandara.wordpress.com/2014/01/03/install-mahout-in-ubuntu-for-beginners/

https://chameerawijebandara.wordpress.com/2014/01/03/install-mahout-in-ubuntu-for-beginners/

http://mahout.apache.org/

mapreduce for machine learning

Education