mad skills for analysis and big data machine learning

24
MAD SKILLS FOR ANALYSIS AND BIG DATA MACHINE LEARNING University of Helsinki Gianvito Siciliano (2014 - Distributed Computing Frameworks for Big Data Seminar)

Upload: gianvito-siciliano

Post on 07-Jul-2015

125 views

Category:

Data & Analytics


2 download

DESCRIPTION

A brief introduction and comparison of big data approaches, framework and machine learning algorithms

TRANSCRIPT

Page 1: MAD skills for analysis and big data Machine Learning

MAD SKILLS FOR ANALYSIS AND

BIG DATA MACHINE LEARNING

University of Helsinki

Gianvito Siciliano

(2014 - Distributed Computing Frameworks for Big Data Seminar)

Page 2: MAD skills for analysis and big data Machine Learning

COMPARISON OF

• APPROACHES

• PLATFORMS

• ALGORITHMS

Page 3: MAD skills for analysis and big data Machine Learning

AGENDA

1. Analysis intro:

• needed skills (MAD)

• important areas (IS, ML)

2. Big Data intensive approaches:

• HPC, ABDS, BDAS

3. Machine Learning tool generations

• SAS, Weka, Hadoop, Mahout, HaLoop, Spark (…)

4. Large scale (ML) algorithms comparison

• K-means, LogReg

Page 4: MAD skills for analysis and big data Machine Learning

“So, what’s getting ubiquitous and cheap? Data. And What is complementary to data? Analysis. “

The value of data analysis has entered common culture, to uncover the

unexpected in your data.

Why data analysis?

Page 5: MAD skills for analysis and big data Machine Learning

The MAD acronym, is made up of three inherent aspects on big data analysis: Magnetic: it concerns attracting data from heterogeneus sources, regardless of the quality of data.

How to make sense of data?

Page 6: MAD skills for analysis and big data Machine Learning

The MAD acronym, is made up of three inherent aspects on big data analysis: Agile: that is about how to make fastly analysis, to obtain action which maximizes the value for the business

How to make sense of data?

Page 7: MAD skills for analysis and big data Machine Learning

The MAD acronym, is made up of three inherent aspects on big data analysis: Deep: is to enable analysts to know both sophisticated statistical methods and the most performing ML algorithms to study enormous datasets on distributed environments.

How to make sense of data?

Page 8: MAD skills for analysis and big data Machine Learning

• Inferential statistics, that allows you to capture the underlying properties of the population (prediction, causality analysis and distributional comparison)

• Machine Learning, “…is the unsung hero that powers many of the

most sophisticated big data analytic applications”.

How to go deep?

Page 9: MAD skills for analysis and big data Machine Learning

DB design capture, modelling, manage, querying…

(SQL)

MAD skills, 2 key points

Programming Style extract, transform, process, investigate…

(MapReduce)

Page 10: MAD skills for analysis and big data Machine Learning

Parallel DBMSs are substantially faster than the MR system once the data is loaded, but that loading the data takes considerably longer in the

db system

MapReduce has captured the interest of many developers because of its simple 2-functions paradigm and it has widely viewed as a more attractive

programming environment than SQL

MR paradigm simplifies the schema-writing process for data: it just require to load and copy data into the storage system.

MAD design for smart environment!

Page 11: MAD skills for analysis and big data Machine Learning

As each approach has its own set of pros and cons, the proposal can be a database-Hadoop hybrid approach to scalable machine learning where batch-learning is performed on the Hadoop platform, and data are stored

(and organised) with the help of some parallel DBMSs.

The critical-skill for a MAD analysts becomes the interoperability on complex pipeline that includes some stage in SQL and some in

MapReduce syntax.

MAD design for smart environment!

Page 12: MAD skills for analysis and big data Machine Learning

• parallelizing and distributing data analysis

• large-scale data sets

• cluster and data fault tolerance

• iterative processing

How to deal with Big Data and Machine Learning?

Page 13: MAD skills for analysis and big data Machine Learning

BIG DATA INTENSIVE PARADIGMS

High Performance Computing is the use of parallel processing for running advanced application programs efficiently, reliably and quickly

parallel processing (MPI)

advance and high performance applications (Molecular Dynamics)

separating the cluster (VMs), compute (SLURM) and storage layer (LUSTRE)

supercomputingHPC stack

app

proc

comm

strg

Page 14: MAD skills for analysis and big data Machine Learning

BIG DATA INTENSIVE PARADIGMS

Apache Big Data Stack Based on integration of compute and data, it introduces an application-level scheduling to facilitate heterogeneous application workloads and high-cluster utilization.

MapReduce paradigm

integration compute/data mgmt

cheap hw

low-need communication among clusters

many open-source implementations, support and docs tight coupling between storage (YARN) and resource (HDFS)

no shared memory

no support for iteration ABDS stack

app

proc

comm

strg

Page 15: MAD skills for analysis and big data Machine Learning

BIG DATA INTENSIVE PARADIGMS

Berkeley Data Analytics Stack It emerge in response of application requirements (short-running tasks) and to overcome the problems of its predecessor (data-caching).

Transform and Act paradigm

multi-level scheduler (MESOS)

runtime iterative processing (SPARK)

distributed shared memory (RDD)

…young? BDAS stack

app

proc

comm

strg

Page 16: MAD skills for analysis and big data Machine Learning

FROM 2 PARADIGMS TO AN HYBRID TOOLHPC - data (intensive) parallel tasks workflows

+

ABDS - computes demanding on clusters and MapReduce style for batch-processing

=

BDAS - provides caching and shared memory

ML - remember that algorithms need iterative processing!

=> SPARK - Distributed framework for (big) data preparation and machine learning, based on Resilient (cache) system to recompute iterations

Page 17: MAD skills for analysis and big data Machine Learning

BIG DATA FRAMEWORK SPACEA

ge/M

atur

ity

Fast Data Big Analytics Big Application

Page 18: MAD skills for analysis and big data Machine Learning

THREE ML GENERATION OF TOOLS

First generation Traditional ML tools for machine learning (SAS, SPSS, Weka, R).

wide set of ML algorithms can facilitate deep analysis vertically scalable non distributed smaller data sets

Second generation ML tools built over Hadoop (Mahout, Pentaho, RapidMiner)

scale to large data sets distributed no database connectivity (ODBC) smaller sub-sets of algorithms low performance with multi-stage applications (e.g machine learning and graph processing) inefficient primitives for data sharing poor support for ad-hoc and interactive queries slow iterative computations

Third generation New purpose-tools (HaLoop, Twister, Pregel, GraphLab, Spark)

modularity shared memory iterative ML algorithms asynchronous graph processing cached memory across iterations/interactions

Page 19: MAD skills for analysis and big data Machine Learning

ML ALGORITHMS

K-means for clustering analysis. The iteration time of k-means is dominated by compute-intensive task of calculating the centroids from a set of datapoints. Logistic Regression, a type of probabilistic statistical classification model. For the comparison it is used for a binary classification task: it is less compute-intensive than k-means and more sensitive to time spent in deserialization and I/O.

Page 20: MAD skills for analysis and big data Machine Learning

K-MEANS

Page 21: MAD skills for analysis and big data Machine Learning

Times

�s�

a) b) c)

d)

iterations

machines iterations input

e) f)

a) b) c)

d)

Tim

es (s

)

iterations

Tim

es (s

)

iterations

LOG REG

Page 22: MAD skills for analysis and big data Machine Learning

CONCLUSIONS

• MAD design can help the analysis process, like the AGILE methodology helps the software development process.

• The better performance of parallel DBMSs should be complementary to MapReduce systems. • MapReduce provides powerful abstractions for data processing, analytics and machine learning to

the end-user that naturally involves in the new ”transform and act” paradigm used in Spark. • Spark takes the best techniques from both ABDS and HPC. It is the core of BDAS and is the best

framework in this scenario. • The resilient distributed datasets (RDDs) is an efficient, general-purpose and fault-tolerant

abstraction for sharing data in cluster applications, and it is the added value of Spark. • Frameworks like Twister and HaLoop are good candidates to be an alternative to Spark but they

do not appear to be mature enough.

Page 23: MAD skills for analysis and big data Machine Learning

Dr. Sasu Tarkoma Dr. Mohammad Hoque

Reviewers

Acknoledgements