lecture 10: machine learning with mapreduce jose m. pena...

1/27

732A54 Big Data AnalyticsLecture 10: Machine Learning with MapReduce

Jose M. PenaIDA, Linkoping University, Sweden

2/27

Contents

▸ MapReduce Framework

▸ Machine Learning with MapReduce

▸ Neural Networks▸ Support Vector Machines▸ Mixture Models▸ K -Means

▸ Summary

3/27

Literature

▸ Main sources

▸ Dean, J. and Ghemawat, S. MapReduce: Simplified Data Processing onLarge Clusters. Communications of the ACM, 51(1):107-113, 2008.

▸ Chu, C.-T. et al. Map-Reduce for Machine Learning on Multicore. InProceedings of the 19th International Conference on Neural InformationProcessing Systems, 281-288, 2006.

▸ Additional sources

▸ Dean, J. and Ghemawat, S. MapReduce: Simplified Data Processing onLarge Clusters. In Proceedings of the 6th Symposium on Operating SystemsDesign and Implementation, 2004.

▸ Yahoo tutorial athttps://developer.yahoo.com/hadoop/tutorial/module4.html.

▸ Slides for 732A95 Introduction to Machine Learning.

4/27

MapReduce Framework▸ Programming framework developed at Google to process large amounts of

data by parallelizing computations across a cluster of nodes.▸ Easy to use, since the parallelization happens automatically.▸ Easy to speed up by using/adding more nodes to the cluster.▸ Typical uses at Google:

▸ Large-scale machine learning problems, e.g. clustering documents fromGoogle News.

▸ Extracting properties of web pages, e.g. web access log data.▸ Large-scale graph computations, e.g. web link graph.▸ Statistical machine translation.▸ Processing satellite images.▸ Production of the indexing system used for Google’s web search engine.

▸ Google replaced it with Cloud Dataflow, since it could not process theamount of data they produce.

▸ However, it is still the processing core of Apache Hadoop, anotherframework for distributed storage and distributed processing of largedatasets on computer clusters.

▸ Moreover, it is a straightforward way to adapt some machine learningalgorithms to cope with big data.

▸ Apache Mahout is a project to produce distributed implementations ofmachine learning algorithms. Many available implementations build onHadoop’s MapReduce. However, such implementations are not longeraccepted.

5/27

MapReduce Framework

▸ The user only has to implement the following two functions:▸ Map function:

▸ Input: A pair (in key , in value).▸ Output: A list list(out key , intermediate value).

▸ Reduce function:▸ Input: A pair (out key , list(intermediate value)).▸ Output: A list list(out value).

▸ All intermediate values associated with the same intermediate key aregrouped together before passing them to the reduce function.

▸ Example for counting word occurrences in a collection of documents:

6/27

MapReduce Framework

7/27

MapReduce Framework

1. Split the input file in M pieces and store them on the local disks of thenodes of the cluster. Start up many copies of the user’s program on thenodes.

2. One copy (the master) assigns tasks to the rest of the copies (theworkers). To reduce communication, it tries to assign map workers tonodes with input data.

8/27

MapReduce Framework

3. Each map worker processes a piece of input data, by passing each pairkey/value to the user’s map function. The results are buffered in memory.

4. The buffered results are written to local disk. The disk is partitioned in Rpieces. The location of the partitions on disk are passed back to themaster so that they can be forwarded to the reduce workers.

9/27

MapReduce Framework

5. The reduce worker reads its partition remotely. This implies shuffle andsort by key.

6. The reduce worker processes each key using the user’s reduce function.The result is written to the global file system.

7. The output of a MapReduce call may be the input to another. Note thatwe have performed M map tasks and R reduce tasks.

10/27

MapReduce Framework

▸ MapReduce can emulate any distributed computation, since this consistsof nodes that perform local computations and occasionally exchangemessages.

▸ Therefore, any distributed computation can be divided into sequence ofMapReduce calls:

▸ First, nodes perform local computations (map), and▸ then, they exchange messages (reduce).

▸ However, the emulation may be inefficient since the message exchangerelies on external storage, e.g. disk.

11/27

MapReduce Framework

▸ Fault tolerance:

▸ Necessary since thousands of nodes may be used.▸ The master pings the workers periodically. No answer means failure.▸ If a worker fails then its completed and in-progress map tasks are

re-executed, since its local disk is inaccessible.▸ Note the importance of storing several copies (typically 3) of the input data

on different nodes.▸ If a worker fails then its in-progress reduce task is re-executed. The results

of its completed reduce tasks are stored on the global file system and, thus,they are accessible.

▸ To be able to recover from the unlikely event of a master failure, the masterperiodically saves the state of the different tasks (idle, in-progress,completed) and the identify of the worker for the non-idle tasks.

▸ Task granularity:

▸ M and R are larger than the number of nodes available.▸ Large M and R values benefit dynamic load balance and fast failure

recovery.▸ Too large values may imply too many scheduling decisions, and too many

output files.▸ For instance, M = 200000 and R = 5000 for 2000 available nodes.

12/27

Machine Learning with MapReduce: Neural Networks

x0

x1

xD

z0

z1

zM

y1

yK

w(1)MD w

(2)KM

w(2)10

hidden units

inputs outputs

▸ Activations: aj = ∑i w(1)ji xi +w (1)j0

▸ Hidden units and activation function: zj = h(aj)▸ Output activations: ak = ∑j w

(2)kj zj +w (2)k0

▸ Output activation function for regression: yk(x) = ak▸ Output activation function for classification: yk(x) = σ(ak)▸ Sigmoid function: σ(a) = 1

1+exp(−a)▸ Two-layer NN:

yk(x) = σ(∑j

w (2)kj h(∑i

w (1)ji xi +w (1)j0 ) +w (2)k0 )

▸ Evaluating the previous expression is known as forward propagation. TheNN is said to have a feed-forward architecture.

▸ All the previous is, of course, generalizable to more layers.

13/27


▸ Consider regressing an K -dimensional continuous random variable on aD-dimensional continuous random variable.

▸ Consider a training set {(xxxn, tttn)} of size N. Consider minimizing the errorfunction

E(www t) =∑

n

En(wwwt) =∑

n

1

2(yyy(xxxn) − tttn)

2

▸ The weight space is highly multimodal and, thus, we have to resort toapproximate iterative methods to minimize the previous expression.

▸ Batch gradient descent

www t+1= www t

− η∇E(www t)

where η > 0 is the learning rate, and ∇E(www t) can be computed efficiently

thanks to the backpropagation algorithm.

▸ Sequential gradient descent

www t+1= www t

− η∇En(wwwt)

where n is chosen randomly or sequentially.

14/27


▸ Sequential gradient descent is less affected by the multimodality problem,as a local minimum of the whole data will not be generally a localminimum of each individual point.

▸ Unfortunately, sequential gradient descent cannot be casted intoMapReduce terms: Each iteration must wait until the previous iterationsare done.

▸ However, each iteration of batch gradient descent can easily be castedinto MapReduce terms:

▸ Map function: Compute the gradient for the samples in the piece of inputdata. Note that this implies forward and backward propagation.

▸ Reduce function: Sum the partial gradients and update www accordingly.

▸ Note that 1 ≤M ≤ n, whereas R = 1.

15/27

Machine Learning with MapReduce: Support Vector Machines

Margin Largest margin

y = 1y = 0

y = −1

margin

y = 1

y = 0

y = −1

Feature space (kernel trick) Input space

Trading errors for margin

16/27

Machine Learning with MapReduce: Support Vector Machines▸ Consider binary classification with input space RD . Consider a training set{(xxxn, tn)} where tn ∈ {−1,+1}. Consider using the linear model

y(xxx) = wwwTφ(xxx) + b

so that a new point xxx is classified according to the sign of y(xxx).▸ The optimal separating hyperplane is given by

arg minwww,b

1

2∣∣www ∣∣2 + C∑

n

ξn

subject to tny(xxxn) ≥ 1 − ξn and ξn ≥ 0 for all n, where

ξn =

⎧⎪⎪⎨⎪⎪⎩

0 if tny(xxxn) ≥ 1

∣tn − y(xxxn)∣ otherwise

are slack variables to penalize (almost-)misclassified points.▸ We usually work with the dual representation of the problem, in which we

maximize

∑n

an −1

2∑n

∑m

anamtntmk(xxxn,xxxm)

subject to an ≥ 0 and an ≤ C for all n.▸ The reason is that the dual representation makes use of the kernel trick,

i.e. it allows working in a more convenient feature space withoutconstructing it.

17/27

Machine Learning with MapReduce: Support Vector Machines

▸ Assume that we are interested in linear support vector machines, i.ey(xxx) = wwwTφ(xxx) + b = wwwTxxx + b. Then, we can work directly with the primalrepresentation.

▸ We can even consider a quadratic penalty for (almost-)misclassified points,in which case the optimal separating hyperplane is given by

arg minwww,b

1

2∣∣www ∣∣2 + C∑

n

ξ2n = arg min

www

1

2∣∣www ∣∣2 + C ∑

n∈E

(wwwTxxxn − tn)2

where n ∈ E if and only if tny(xxxn) < 1.▸ Note that the previous expression is a quadratic function with linear

inequality constraints and, thus, it is concave (up) and, thus, ”easy” tominimize.

▸ For instance, we can use again batch gradient descent. The gradient isnow given by

www + 2C ∑n∈E

(wwwTxxxn − tn)xxxn

▸ Again, each iteration of batch gradient descent can easily be casted intoMapReduce terms:

▸ Map function: Compute the gradient for the samples in the piece of inputdata.

▸ Reduce function: Sum the partial gradients and update www accordingly.

▸ Note that 1 ≤M ≤ n, whereas R = 1.

18/27

Machine Learning with MapReduce: Mixture Models

▸ Sometimes the data do not follow any known probability distribution but amixture of known distributions such as

p(xxx) =K

∑k=1

p(k)p(xxx ∣k)

where p(xxx ∣k) are called mixture components, and p(k) are called mixingcoefficients, usually denoted by πk .

▸ Mixture of multivariate Gaussian distributions:

p(xxx) =∑k

πkN (xxx ∣µµµk ,ΣΣΣk) and N (xxx ∣µµµk ,ΣΣΣk) =1

2πD/2

1

∣ΣΣΣk ∣1/2e−

12(xxx−µµµk)

TΣΣΣ−1k (xxx−µµµk)

19/27


▸ Mixture of multivariate Bernouilli distributions:

p(xxx) =∑k

πkBern(xxx ∣µµµk)

whereBern(xxx ∣µµµk) =∏

i

µxiki(1 − µki)

(1−xi )

20/27


▸ Given a sample {xxxn} of size N from a mixture of multivariate Bernouillidistributions, the expected log likelihood function is maximized when

πMLk =

∑n p(znk ∣xxxn,µµµ,πππ)

N

µMLki =

∑n xnip(znk ∣xxxn,µµµ,πππ)

∑n p(znk ∣xxxn,µµµ,πππ)

where zzzn is a K -dimensional binary vector indicating (fractional)component memberships.

▸ This is not a closed form solution, but it suggests the following algorithm.

EM algorithm

Set πππ and µµµ to some initial valuesRepeat until πππ and µµµ do not change

Compute p(znk ∣xxxn,µµµ,πππ) for all n /* E step */

Set πk to πMLk , and µki to µML

ki for all k and i /* M step */

21/27


▸ Each iteration of the EM algorithm can easily be casted into MapReduceterms:

▸ Map function: Compute

∑n∈PID

p(znk ∣xxxn,µµµ,πππ) (1)

and∑

n∈PID

xnip(znk ∣xxxn,µµµ,πππ) (2)

where PID is the piece of input data.▸ Reduce function: Sum up the results (1) of the map tasks and divide it by

N. Sum up the results (2) of the map tasks and divide it by the sum of theresults (1).

▸ Note that 1 ≤M ≤ N, whereas R = 1.

22/27

Machine Learning with MapReduce: K -Means Algorithm

▸ Recall that

N (xxx ∣µµµk ,ΣΣΣk) =1

2πD/2

1

∣ΣΣΣk ∣1/2exp(−

1

2(xxx −µµµk)

TΣΣΣ−1k (xxx −µµµk))

▸ Assume that ΣΣΣk = εIII where ε is a variance parameter and III is the identitymatrix. Then,

N (xxx ∣µµµk ,ΣΣΣk) =1

2πD/2

1

∣ΣΣΣk ∣1/2exp(−

1

2ε∣∣xxx −µµµk ∣∣

2)

p(k ∣xxx ,µµµ,πππ) =πk exp(− 1


2)

∑k πk exp(− 12ε∣∣xxx −µµµk ∣∣2)

▸ As ε→ 0, the smaller ∣∣xxx −µµµk ∣∣2 the slower exp(− 1


2) goes to 0.

▸ As ε→ 0, individuals are hard-assigned (i.e. with probability 1) to thepopulation with closest mean. This clustering technique is known asK -means algorithm.

▸ Note that πππ and ΣΣΣ play no role in the K -means algorithm whereas, in eachiteration, µµµ is updated to the average of the individuals assigned topopulation k.

23/27


K -means algorithm EM algorithm

24/27


▸ Each iteration of the K -means algorithm can easily be casted intoMapReduce terms:

▸ Map function: Choose the population with the closest mean for eachsample in the piece of input data.

▸ Reduce function: Recalculate the population means from the results of themap tasks.

▸ Note that 1 ≤M ≤ N, whereas R = 1.

25/27

Machine Learning with MapReduce

26/27

Machine Learning with MapReduce

27/27

Summary

▸ MapReduce is a framework to process large datasets by parallelizingcomputations.

▸ The user only has to specify the map and reduce functions, andparallelization happens automatically.

▸ Many machine learning algorithms (e.g. SVMs, NNs, MMs, K -means) caneasily be reformulated in terms of such functions.

▸ This does not apply for algorithms based on sequential gradient descent.

▸ Moreover, MapReduce is inefficient for iterative tasks on the same dataset:Each iteration is a MapReduce call that loads the data anew from disk.

▸ Such iterative tasks are common in many machine learning algorithms, e.g.gradient descent, backpropagation algorithm, EM algorithm, K -means.

▸ Solution: Spark framework, in the next lecture.

lecture 10: machine learning with mapreduce jose m. pena...

Documents