yihua huang, ph.d., professor email : [email protected] [email protected] nju-pasa lab for big...

Download Yihua Huang, Ph.D., Professor Email : yhuang@nju.edu.cn yhuang@nju.edu.cn NJU-PASA Lab for Big Data Processing Department of Computer Science and Technology

If you can't read please download the document

Upload: annis-atkins

Post on 21-Dec-2015

221 views

Category:

Documents


1 download

TRANSCRIPT

  • Slide 1
  • Yihua Huang, Ph.D., Professor Email [email protected] [email protected] NJU-PASA Lab for Big Data Processing Department of Computer Science and Technology Nanjing University May 29, 2015, India A Unified Programming Model and Platform for Big Data Machine Learning & Data Mining
  • Slide 2
  • PASA Big Data Lab at Nanjing University Our lab studies on Parallel Algorithms Systems, and Applications for Big Data Processing We are the earliest big data lab in China, entering big data research area since 2009 Now we are contributor of Apache Spark and Tachyon
  • Slide 3
  • Parallel Computing Models and Frameworks & Hadoop/Spark Performance Optimization Hadoop job and resource scheduling optimization Spark RDD persisting optimization Big Data Storage and Query Tachyon Optimization Performance Benchmarking Tools for Tachyon and DFS HBase Secondary Indexing (HBase+In-memory) and query system Large-Scale Semantic Data Storage and Query Large-scale RDF semantic data storage and query system(HBase+In-memory) RDFS/OWL semantic reasoning engines on Hadoop and Spark Machine Learning Algorithms and Systems for Big Data Analytics Parallel MLDM algorithm design with diversified parallel computing platforms Unified programming model and platform for MLDM algorithm design
  • Slide 4
  • Part 1. Parallel Algorithm Design for Machine Learning and Data Mining Part2. Unified Programming Model and Platform for Big Data Analytics Contents
  • Slide 5
  • Part1. Parallel Algorithm Design for Machine Learning and Data Mining
  • Slide 6
  • A variety of Big Data parallel computing platforms (Hadoop, Spark, MPI, etc.) emerging Serial machine learning algorithms not able to finish computation upon large-scale dataset in acceptable time Do not fit any of existing parallel computing platforms and thus need to rewrite them in parallel upon different parallel computing platforms Our lab has entered into Big Data area since 2009, starting from writing a variety of parallel Machine Learning algorithms on Hadoop, Spark, etc.
  • Slide 7
  • Frequent Itemset Mining is one of the most important and often used algorithm for data mining Apriori algorithm is the most established algorithm for finding frequent itemset from a transactional dataset Tao Xiao, Shuai Wang, Chunfeng Yuan, Yihua Huang. PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets. The Fourth International Symposium on Parallel Architectures, Algorithms and Programming, PAAP 2011, p 252-257, 2011 Hongjian Qiu, Rong Gu, Chunfeng Yuan and Yihua Huang. YAFIM: A Parallel Frequent Itemset Mining Algorithm with Spark. The 3rd International Workshop on Parallel and Distributed Computing for Large Scale Machine Learning and Big Data Analytics, conjunction with IPDPS 2014, May 23, 2014. Phoenix, USA
  • Slide 8
  • Suppose I is an itemset consisting of items from the transaction database D Let N be the number of transactions D Let M be the number of transactions that contain all the items of I M /N is referred to as the support of I in D Example Here, N = 4, let I = { I 1, I 2}, than M = 2 because I = { I 1, I 2} is contained in transactions T100 and T400 so the support of I is 0.5 (2/4 = 0.5) If sup( I ) is no less that an user-defined threshold, then I is referred to as a frequent itemset Goal of frequent sets mining To find all frequent k-itemsets from a transaction database (k = 1, 2, 3,....)
  • Slide 9
  • Apriori algorithm A classic frequent sets mining algorithm Needs multiple passes over the database In the first pass, all frequent 1-itemsets are discovered In each subsequent pass, frequent (k+1)-itemsets are discovered, with the frequent k- itemsets found in the previous pass as the seed (referred to as candidate itemsets) Repeat until no more frequent itemsets can be found
  • Slide 10
  • [1] Rakesh Agrawal, Ramakrishnan Srikant: Fast Algorithms for Mining Association Rules in Large Databases. VLDB 1994: 487-499 Apriori Algorithm [1]:
  • Slide 11
  • The FIM process is both data-intensive and computing-intensive. transactional dataset is become larger and larger Iteratively trying all combinations from 1-itemset to k-itemset is time-consuming FIM needs to scan the datasets iteratively for many times.
  • Slide 12
  • Apriori in MapReduce:
  • Slide 13
  • Experimental results PSON achieves great speedup compared to SON algorithm
  • Slide 14
  • Parallel Aprioir algorithm with MapReduce needs to run the MapReduce job iteratively It need to scan the dataset iteratively and store all the intermediate data in HDFS As a result, the parallel Apriori algorithm with MapReduce is not efficient enough
  • Slide 15
  • YAFIM, Apriori algorithm implemented in Spark Model, can gain about 18x speedup in our experiments Our YAFIM contains two phases to find all frequent itemsets Phase : Load transaction datasets as a Spark RDD object and generate 1-frequent itemsets; Phase : Iteratively generate (k+1)-frequent itemset from k-frequent itemset.
  • Slide 16
  • All transaction data reside in RDD Load all transaction data into a RDD
  • Slide 17
  • Phase
  • Slide 18
  • Phase I
  • Slide 19
  • Methods to speedup performance In-memory computing with RDDs. We make full use of RDDs and complete total computing in memory Share data with Broadcast. We adopt broadcast variables abstraction in the Spark to reduce data transformation in tasks
  • Slide 20
  • We ran experiments with both programs on four benchmarks [3] with different characteristics: MushRoom T10I4D100K Chess Pumsb_star Achieving about 18x speedup with Spark compared to the algorithm with MapReduce
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • We also apply our YAFIM in medical text semantic analysis application and achieve 25x speedup.
  • Slide 25
  • Basic Algorithm Input: A dataset of N data points that need to be clustered into K clusters Output K clusters Choose k cluster center Centers[K] as initial cluster centers Loop: for each data point P from dataset Calculate the distance between P and each of Centers[i] Save p to the nearest cluster center Recalculate the new Centers[K] Go loop until cluster centers converge
  • Slide 26
  • Pseudo codes for MapReduce class Mapper setup() { read k cluster centers Centers[K]; } map(key, p) // p is a data point { minDis = Double.MAX VALUE; index = -1; for i=0 to Centers.length { dis= ComputeDist(p, Centers[i]); if dis < minDis { minDis = dis; index = i; } emit(Centers[i].ClusterID, (p,1)); }
  • Slide 27
  • Pseudo codes for MapReduce To optimize the data I/O and network transfer, we can use Combiner to reduce the number of key-value pairs from a Map node class Combiner reduce(ClusterID, [(p1,1), (p2,1), ] ) { pm = 0.0 n = [(p1,1), (p2,1), ] ; for i=0 to n pm += p[i]; pm = pm / n; // Calculate the average of points in the Cluster emit(ClusterID, (pm, n)); // use it as new Center }
  • Slide 28
  • Pseudo codes for MapReduce class Reducer reduce(ClusterID, valueList = [(pm1,n1),(pm2,n2) ] ) { pm = 0.0 n=0; k = length of valuelist belonging to a ClusterID; for i=0 to k { pm += pm[i]*n[i]; n+= n[i]; } pm = pm / n; // calculate new center of the Cluster emit(ClusterID, (pm,n)); // output new center of the Cluster } In main() function of the MapReduce Job, set a loop to run the MapReduce job until converge
  • Slide 29
  • Scala codes while(tempDist > convergeDist && tempIter < MaxIter) { var closest = data.map ( p => (closestPoint(p, kPoints), (p, 1))) // determine nearest center for each P // calculate the average of all points in a cluster as new center var pointStats = closest.reduceByKey{case ((x1, y1), (x2, y2)) => (x1 + x2, y1 + y2)} var newPoints = pointStats.map {pair => (pair._1, pair._2._1 / pair._2._2)}.collectAsMap() tempDist = 0.0 for (i
  • Slide 30
  • Execution time(s) Number of Nodes 1st iteration next iteration Spark speedup about 4-5 times compared to MapReduce Peng Liu, Jiayu Teng, Yihua Huang. Study of k-means algorithm parallelization performance based on spark. CCF Big Data 2014
  • Slide 31
  • Basic Idea Given m classes from training dataset: { C 1,C 2, , C m } Predict which class a testing sample X will belong to. => Only need to calculate Suppose x k is independent to each other => Thus, we can count from training samples to get both
  • Slide 32
  • Training Map Pseudo Code to calculate P(X|Ci) and P(Ci) class Mapper map(key, tr) // tr is a training sample { tr trid, X, Ci emit(Ci, 1) for j=0 to X.lenghth) { X[j] xnj & xvj // xnj: name if xj, xvj: value of xj emit(, 1) }
  • Slide 33
  • Training Reduce Pseudo Code to calculate P(xj|Ci) and P(Ci) class Reducer reduce(key, value_list) // key: either Ci or { sum =0; // count for P(xj|Ci) and P(Ci) while(value_list.hasNext()) sum += value_list.next().get(); emit(key, sum) } // Trim and save output as P(xj|Ci) and P(Ci) tables in HDFS
  • Slide 34
  • Predict Map Pseudo Code to Predict Test Sample class Mapper setup() { load P(xj|Ci) and P(Ci) data from training stage FC = { (Ci, P(Ci) ) }, FxC = { (, P(xj|Ci) ) } } map(key, ts) // ts is a test sample { ts tsid, X MaxF = MIN_VALUE; idx = -1; for (i=0 to FC.length) { FXCi = 1.0 Ci = FC[i].Ci; FCi = FC[i]. P(Ci) for (j=0 to X.length) { xnj = X[j].xnj; xvj = X[j].xvj Use to scan FxC, get P(xj|Ci) FXCi = FXCYi * P(xj|Ci) ; } if(FXCi* FCi >MaxF) { MaxF = FXCi*FCi; idx = i; } } emit(tsid, FC[idx].Ci) }
  • Slide 35
  • Training SparkR Code to calculate P(xj|Ci) and P(Ci) parseVector For large-scale matrix multiplication, how to partition matrix is very critical for the computation performance > We developed an automatic matrix partitioning and optimized execution algorithm in terms of the shapes and sizes of matrices and then schedule them for execution in parallel HAMA Blocking CARMA Blocking Broadcasting
  • Slide 64
  • Marlin: Optimized Distributed Matrix Multiplication with Spark OctMatrix: Distributed Matrix Computation Lib Multiply big and small matrices Multiply two big matrices
  • Slide 65
  • Marlin: Optimized Distributed Matrix Multiplication with Spark OctMatrix: Distributed Matrix Computation Lib
  • Slide 66
  • Marlin: Optimized Distributed Matrix Multiplication with Spark OctMatrix: Distributed Matrix Computation Lib 4~5x Speedup Compared to SparkR
  • Slide 67
  • Marlin: Optimized Distributed Matrix Multiplication with Spark OctMatrix: Distributed Matrix Computation Lib Matrix Multiply, 96 partitions, executor memory 10GB, except that case 3_5 is 20GB
  • Slide 68
  • OctMatrix Data Representation and Storage \Octopus_HOME \user-sesscion-id1\ \matrix-a info row_index \row-data par1.data parN.data col_index \col-data par1.data parN.data \matrix-b \matrix-c \user-sesscion-id2\ \user-sesscion-id3\ > Matrix data can be stored in local file, HDFS, and Tachyon, allowing to read from and write to these file systems from R programs > Matrix data is organized and stored in terms of certain structure
  • Slide 69
  • Machine Learning Lib built with OctMatrix Classification and regression Linear Regression Logistic Regression Softmax Linear Support Vector Machine (SVM) Clustering K-Means Feature extraction Deep Neural Network(Auto Encoder) More MLDM algorithms to come
  • Slide 70
  • How Octopus Works > Use standard R programming platform and allow users to write and implement codes for a variety of MLDM algorithms based on large-scale matrix computation model > Have integrated Octopus with Spark, Hadoop MapReduce and MPI allowing seamless switch and execution on top of underlying platforms Spark Hadoop MapReduce MPI Single Machine Octopus
  • Slide 71
  • Octopus Features Summary Ease-to-use/High-level User APIs high-level matrix operators and operations APIs. similar to that of the Matrix/Vector operation APIs in the standard R language. does not require the low-level knowledge for the distributed system knowledge or programming skills. Write Once, Run Anywhere programs written with Octopus can transparently run on top of different computing engines such as Spark, Hadoop MapReduce, or MPI. using OctMatrix APIs with small data running on a single-machine R engine for test and run the program on large scale data without modifying the codes. support a number of I/O sources including Tachyon, HDFS, and local file systems.
  • Slide 72
  • Octopus Features Summary Distributed R apply Functions offers the apply() function on OctMatrix. The parameter function will be executed on each element/row/column of the OctMatrix on the cluster in parallel. parameter functions passed to apply() can be any R functions including the UDFs. Machine Learning Algorithm Library Implemented a bunch of scalable machine learning algorithms and demo applications built on top of OctMatrix. Seamless Integration with R Ecosystem offers its features in a R package called OctMatrix. naturally takes advantage of the rich resources of the R ecosystem
  • Slide 73
  • Demonstrations Read/Write Octopus Matrix
  • Slide 74
  • Demonstrations A Variety of R Functions on Octopus
  • Slide 75
  • Demonstrations Logistic Regression Training Predicting Testing Change enginetype will be able to quickly switch to and run on top of one of underlying platforms without need to modify any other codes
  • Slide 76
  • Demonstrations K-Means Algorithm Testing
  • Slide 77
  • Demonstrations Linear Regression Algorithm Testing
  • Slide 78
  • Demonstrations Code Style Comparison between R and Octopus LR Codes with Standard R LR Codes with Octopus
  • Slide 79
  • Demonstrations Code Style Comparison between R and Octopus K-Means Codes with Standard R K-Means Codes with Octopus
  • Slide 80
  • Demonstrations Algorithm with MPI and Hadoop MapReduce Linear Algebra running with MPI Start a MPI Daemon to run MPI-Matrix behind
  • Slide 81
  • Demonstrations Algorithm with MPI and Hadoop MapReduce Linear Algebra running with Hadoop MapReduce
  • Slide 82
  • Octopus Project Website and Documents http://pasa-bigdata.nju.edu.cn/octopus/
  • Slide 83
  • Project Team Yihua Huang, Rong Gu, Zhaokang Wang, Yun Tang, Haipeng Zhan Contact Information Dr.Yihua Huang, Professor NJU-PASA Big Data Lab http://pasa-bigdata.nju.edu.cn Department of Computer Science and Technology Nanjing University, Nanjing, P.R.China Email [email protected] [email protected]