parallel machine learning

Parallel Machine Learning

Janani ChakkaradhariInformation Technology for Business Intelligence

Technische Universitat Berlin

February 13, 2014

Abstract

Scalability has been an essential factor for any kind of computational algo-rithm while considering its performance. In this Big Data era, gathering of largeamounts of data is becoming easy. Data analysis on Big Data is not feasible usingthe existing Machine Learning (ML) algorithms and it perceives them to performpoorly. This is due to the fact that the computational logic for these algorithms ispreviously designed in sequential way. MapReduce [1] becomes the solution forhandling billions of data efficiently. In this report we discuss the basic buildingblock for the computation behind ML algorithms, two different attempts to par-allelize machine learning algorithms using MapReduce and a brief description onthe overhead in parallelization of ML algorithms.

1 IntroductionThe significance of Machine Learning algorithms are widely known and its acquain-tance in various applications brings in much more benefits in business as well as inresearch community. In traditional ML algorithms, the computational methods werebuilt by thinking the data fits in memory. On the other hand, the current distributedinfrastructure of Information Systems (IS) facilitates the computerized society to eas-ily access and also generate data in almost every action involved in their day to-daylife. This perpetual increase of data leads to degrade in performance of ML algorithmswhich had been proved to produce fast and prominent results with smaller datasetswhich in turn becomes the cause for “curse of modularity” [9].

With the advent of MapReduce programming model, data voluminous is handledefficiently in parallel as it follows divide and conquer methodology for execution.“Learning can become limited by computation time and not by data volume with helpof MapReduce and large clusters of machines” [8] and this imposes the fact that MLalgorithms has to be re-modified in order to be executed in parallel architecture.

Thus parallelization of ML algorithms using MapReduce model would results inincrease in speed of computation. Earlier works on this topic had been proved to pro-duce increased performance. This report presents a gentle background study on theexploitation of Linear Algebra in ML in section 2, followed by an overview of oneof the novel approach for parallelization of Stochastic Gradient Descent algorithm forMatrix Factorization [2] in section 3, and a brief summary on declarative ML which isan attempt to provide a declarative way of executing some of the ML algorithms andlinear algebra primitives on Hadoop using a system called SystemML [3] in section 4.

1

2 Computational Engine for Machine LearningMathematics and computer science are like the tracks of a train, they always go togetherto make sure a good journey for real world users. Linear algebra has prominent rolein ML. Transforming problem space into linear functions is one of the elementaryapproaches used in predictive algorithms. Matrices are used as means of representinglinear functions. In other words, the interaction between two entities of a system can berepresented in two dimensional form known as matrix. The elements inside the matrixrepresents the magnitude of those interactions between two finite set of objects alsoknown as dyadic data [4]. Analysis of the system using matrix technique allows one topredict the effect of individual interactions on the overall system. Some of the eminentapplications in ML based on linear algebra are listed below,

• Singular Value Decomposition (SVD) is one of famous method for its applica-tions in image compression, determining oscillations or damages in structureslike bridge during the design phase and many more.

• Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA)are used as a feature extraction step before classification.

• Eigen value and Eigen vectors has its proven results in PageRank algorithm.

• Analysis based on dyads such as topic modeling, keyword search and recom-mender systems are based on Non Negative Matrix Factorization technique [6].

3 Large Scale Matrix Factorization with DSGDIn this section, an overview of Distributed Stochastic Gradient Descent algorithm isdescribed with a brief review on optimization of Matrix Factorization using StochasticGradient Descent and a quick introduction to functional usage of Matrix Factorizationand Stochastic Gradient Descent.

3.1 Matrix FactorizationMatrix Factorization is mainly used to extract interaction structure from dyadic data[6]. The interaction structure includes the following [4]

• Co-occurrence

• Strength of preference or the association

• Word clustering, word sense disambiguation and thesaurus construction in textbased information retrieval

• Modeling of preference and consumption behavior

• The dyad in computer vision applications represents the feature observed at aparticular image location.

2

3.2 Stochastic Gradient Descent (SGD)Gradient descent has fruitful applications in optimization problems. It predominantlyhelps in minimizing the cost function of ML algorithms such as linear regression wherethe weight vector or the parameter vector is determined by minimizing the average ofsum of square errors between the predictions minus the actual values in the training set[7].

One main drawback of gradient descent is that it requires all the training data setfor computing the average square error in each step of updating parameter vector andrepeats this process until the parameter vector converges. This slows down the speedof algorithm. It is also termed as Batch Gradient Descent.

In contrast, Stochastic Gradient Descent takes single training data at a time ran-domly and updates the parameter vector with respect to that training data in each stepand repeats the process until it converges. So this eliminates the need to look at the en-tire data set in each step and scans the entire training set for repetition of the algorithm.

3.3 Stochastic Gradient Descent for Matrix FactorizationMatrix Factorization helps to reconstruct the original matrix from the partially observedmatrix using some approximation technique. For example in the Netflix matrix prob-lem of recommendation [5], the rows represent the user and columns represents themovie. The matrix is partially filled with user ratings given to the movies. By consid-ering the existing rating values, Matrix Factorization tries to find the missing values. Insimplest form, this can be done by associating each user and each movie some numbers(factors) such that the product of these two numbers would be close as possible as theoriginal rating.

The discrepancies between the original input matrix and product of the factors hereis the cost function. We would try to reduce this cost function to get the most ap-propriate factors. One way to do this, is by employing Stochastic Gradient Descentalgorithm and SGD usually produces greater performance results in sequential execu-tion. Since SGD approximation would end up with noisy values the cost function inhere includes regularization and other informations along with prediction error. SGDtries to minimize sum of all losses in the entire matrix. SGD works as follows [2],

• Step 1: Takes a random entry from the training set

• Step 2: Evaluate loss function

• Step 3: Update parameter spaces

• Step 4: Repeat Step 1 to 3 for all the entries in the matrix

We can not run this algorithm in parallel using MapReduce. The reason is thefollowing, each mapper runs SGD on the subsets of large matrix. It reads currentrow and current column of the subset, evaluates local loss function and updates theparameters (i.e. the rows and columns) of the corresponding matrix subset. As weconsidered SGD runs in parallel, it could be possible for the algorithm to be executedon another subset of the matrix which is dependent (the same column but differentrow). This deliberately leads the second mapper to read the values that are updated bythe first mapper at the same time. So this makes the algorithm not to run in parallelarchitecture.

3

As described by Gemulla [2], not all the subsets are dependents in the matrix. InMost of the cases the subsets are completely independent to each other so that it couldbe possible to run SGD by locking the rows and columns of that subset. This ideaforms the basis for parallelized SGD.

3.4 Distributed SGD for Matrix Factorization (DSGD)DSGD utilizes the concept of independent rows and columns. Suppose if we have dnumber of nodes in the cluster, we split the input matrix (the training set of knownratings) into d�d smaller matrices and distribute the smaller matrix into the d blockssuch that the each node has the blocks of entire row as shown in the Figure 1.

Figure 1: Example Stratum of 3 Cluster nodes

The interchangeable sub matrices is called stratum basically represents a partitionof the underlying matrix dataset. In the paper [2], the stratification is performed bypermutation such that d nodes has the possible independent block combinationsd!. Forexample 3 nodes have 6 possible stratums and this 6 stratums forms a single sequenceof stratra. The DSGD algorithm works as follows, Assuming there are d nodes avail-able, Z is training set input matrix, W and H are the parameter factors of the inputmatrix.

• Step 1: Divide the input matrix to Z into dd and distribute it over the clusters. Hand W parameters are equally distributed on d blocks on rows and columns suchthat W with d�1 and H with 1�d dimensions. Compute the strata sequence forthe input blocks using permutations. For each stratum in the strata, do step 2 andstep 3

• Step 2: Select a stratum that are independent, for example the blocks along thediagonal the red boxes as shown in the figure from the sequence of strata (allpossible combinations of stratum).

• Step 3: Run SGD on the selected blocks in parallel to find the local minimumfor loss function. Sum up the results of local losses computed at each block andupdate the corresponding factor matrices W and H

This is how DSGD runs SGD algorithm in a distributed manner within a stratum.DSGD outperforms ALS (Alternating Least Squares) method for matrix factorization[2]. Since DSGD avoid averaging over loss functions when executed in parallel whichmakes the algorithm simpler and versatile

4

4 Declarative Machine Learning: SystemMLThe overhead in parallelizing ML algorithms can be easily understood by simple SGDalgorithm as we discussed in previous section. This makes a very clear argument thatthe researchers have to carefully analyze each sequentially powerful ML algorithm tomake it parallel and to be executed in MapReduce programming model. The cost of im-plementing as MapReduce jobs is high and also for better performance sometimes thesame algorithm has to be hand tuned. Hence there is no space for the discussion of op-timization in MapReduce jobs. For example in case of matrix multiplication problem,the order execution of multiplication has higher performance impact [3]. Researchersfrom IBM Almaden and Watson research center has proposed a new approach for han-dling parallelization of ML algorithms which also considers optimization into accountand it is called SystemML.

SystemML is analogous to HiveQL developed by Facebook for executing datawarehouse queries on large clusters where the queries are converted to MapReduce jobswhich will be executed on Hadoop by the HiveQL engine. Similarly SystemML pro-vides a declarative platform for expressing ML algorithms and linear algebra primitivesand converts the abstract representation into executable MapReduce jobs on Hadoop.

4.1 Application areas of SystemMLIn SystemML, ML algorithms are expressed in High Level Language called Declar-ative Machine Learning (DML) which is comparable to R. DML supports operationssuch as transpose of a matrix, matrix multiplication, iterative algorithms using “for”and “while” constructs and soon. So this makes user to focus on writing scripts thatanswers to what constructs to use for computation rather than how to express com-putation. SystemML is highly scalable and efficiently tunes the performance. It isused in different fields such as predictive modeling, recommender systems, and searchanalysis.

4.2 System Architecture of SystemMLSystemML takes the DML script as input and passes through the different components[3] and results in parsed representation of the initial script. It supports built in datatypes for representing matrices and scalars. The first step in SystemML is Identifyingthe statement blocks based on the constructs that breaks the sequential flow of DMLprogram. For each statement block it does the following,

4.3 High level Operator (HOP)HOP component analysis consumes and results in the following input and output.

Input: Parsed statement blocksAction: The computation in each statement block instantiates one HOP Dag (Di-

rected Acyclic Graph). HOP Dag represents the basic operations on Matrices and scalarsuch as an operation or transformation.

Optimizations: Algebraic rewrites, selection of physical representation for inter-mediate matrices and cost based optimizations

Output: High level execution plan (HOP Dags) representing dataflow

5

4.4 Low level Operator (LOP)LOP component analysis is following by HOP and the corresponding input and outputare as follows,

Input: High level execution plan (HOP Dags)Action: HOP Dags are converted into Low level physical plans (LOP Dags) that

can be executed as MapReduce jobs. HOP Dags are parsed from bottom to top. EachHOP Dag is converted into one or more LOP Dags. The input and the output formatsof each LOP is key value pairs. Since single computation leads to multiple LOPs,SystemML tries to combine these LOPs to fit into a single MapReduce job. This is im-plemented by using a novel algorithm named piggybacking which reduces the numberof scans performed on input data during the execution of MR jobs. This is described insection

Output: Low level execution plan (LOP Dags)

4.5 RuntimeThe runtime makes sure that the input matrices are represented as key value pairs bydisregarding the cells without a value in the matrices and by that way it reduces the sizeof input matrix representation as they are inherently sparse. SystemML collects thelocal sparsity information by employing blocking operation on the input matrix. Theinput matrix is divided into smaller matrices called blocks and each block is representedwith a block id and the cells represent the block value along with parameter indicatingwhether the block is dense or sparse. The block size has major impact on generatednumber of key value pairs by runtime [3].

Generic MapReduce Job (GM-R) is the main Execution engine in System ML andit is instantiated by the Piggybacking algorithm (Multiple LOPs inside single MR jobs)

Control Module helps in coordinating the execution MapReduce jobs and involvedin computations such as arithmetic operations, predictive evaluations and soon. Mul-tiple optimizations are performed in the runtime component (dynamically decidingbased on data characteristics)

4.6 PiggybackingThis algorithm packages multiple LOPs in the SystemML into a single MapReduce jobby considering the execution locations of each LOP at runtime. The execution locationidentifies whether a LOP operation can be executed in Map or Reduce or it requiresboth Map and Reduce for complete execution of the operation. 2 shows the list ofdifferent LOP operations and their corresponding execution location. For example thegroup operation of LOP has to be executed on both Map and Reduce phase and so it ismarked as MapAndReduce.

We consider the following example in 3 to layout the logic behind piggybackingalgorithm. The left part of the diagram represents the LOP Dag for a matrix multiplica-tion of matrix W with its transpose. LOP Dags are parsed from bottom up fashion. Thealgorithm starts by sorting LOP operations in topological order and the result of sort isrepresented in center of the diagram. The algorithm works iteratively where it createsa new MR job at the beginning of each iteration. The order of assigning each LOPinto the MR job is as follows, it first assigns the LOPs that only requires Map phaseindicated by Map or Reduce location in 2 followed by assigning LOPs that needs bothMapAndReduce phases and finally ends by assigning LOPs that requires only Reduce

6

Figure 2: Execution locations of LOP from [3]

phase. The algorithm makes sure that another descendant with execution location ofMapAndReduce will not be assigned to the same job.

Figure 3: Example Piggybacking

In our example since Data W and Transform LOPs spans only Map or Reduceoperation, it is assigned to the Map of first MR job. mmcj is the first LOP that spansMap and Reduce phases, it is assigned to the both Map and Reduce phases of first MRjob. Since the first MR job is already has a LOP with location MapAndReduce, theGroup LOP which also has the same location of execution can not be assigned to thefirst MR job. Hence the iteration ends and the next iteration start by instantiating thesecond new MR job. Finally, Group and Aggregation operations are assigned to thissecond MR job which completes the piggy backing algorithm in this examples.

5 ConclusionIn this report we have seen the requirements and the importance of research works inthe parallelization of ML algorithms and the role of the branch of mathematics, LinearAlgebra in ML algorithms. The realization of the level of difficulty in parallelizing MLalgorithms is covered by explaining a novel approach employed by DSGD algorithmwhich is an effort to parallelize SGD for large clusters of data. Moreover we alsodiscussed about SystemML which provides an easier declarative platform for executingML algorithms to the users in different fields.

Even though SystemML is concise and provides user friendly platform for execut-ing limited forms of ML algorithms and some linear algebra primitives such as matrixmultiplication, arithmetic operations and MF, DML does not support more complex

7

features of object oriented paradigm. It also does not support data structures such asArrays and Lists that are frequently used in most of the ML algorithms instead this ispossible in R, a language that provides a comprehensive set of flexible constructs sta-tistical and ML algorithms. On the other hand, Apache Mahout also provides completeset of ML algorithms that are Hadoop based packages but it still needs to be hand tunedfor different data sets and it is more complex in users perspective.

References[1] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on

large clusters. Communications of the ACM, 51(1):107–113, 2008.

[2] Rainer Gemulla, Erik Nijkamp, Peter J Haas, and Yannis Sismanis. Large-scalematrix factorization with distributed stochastic gradient descent. In Proceedingsof the 17th ACM SIGKDD international conference on Knowledge discovery anddata mining, pages 69–77. ACM, 2011.

[3] Amol Ghoting, Rajasekar Krishnamurthy, Edwin Pednault, Berthold Rein-wald, Vikas Sindhwani, Shirish Tatikonda, Yuanyuan Tian, and ShivakumarVaithyanathan. Systemml: Declarative machine learning on mapreduce. In DataEngineering (ICDE), 2011 IEEE 27th International Conference on, pages 231–242. IEEE, 2011.

[4] Thomas Hofmann, Jan Puzicha, and Michael I Jordan. Learning from dyadic data.Advances in neural information processing systems, pages 466–472, 1999.

[5] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniquesfor recommender systems. Computer, 42(8):30–37, 2009.

[6] Chao Liu, Hung-chih Yang, Jinliang Fan, Li-Wei He, and Yi-Min Wang. Dis-tributed nonnegative matrix factorization for web-scale dyadic data analysis onmapreduce. In Proceedings of the 19th international conference on World wideweb, pages 681–690. ACM, 2010.

[7] Andrew Ng. Cs229 lecture notes. CS229 Lecture notes, 1(1):1–3, 2000.

[8] Tutorial on Modeling with Hadoop in KDD2011 by Vijay Narayanan and MilindBhandarkar. Modeling with hadoop.

[9] Charles Parker. Unexpected challenges in large scale machine learning. In Pro-ceedings of the 1st International Workshop on Big Data, Streams and Heteroge-neous Source Mining: Algorithms, Systems, Programming Models and Applica-tions, pages 1–6. ACM, 2012.

8