mad skills for analysis and big data machine learning

MAD SKILLS FOR ANALYSIS AND

BIG DATA MACHINE LEARNING

University of Helsinki

Gianvito Siciliano

(2014 - Distributed Computing Frameworks for Big Data Seminar)

COMPARISON OF

• APPROACHES

• PLATFORMS

• ALGORITHMS

AGENDA

1. Analysis intro:

• needed skills (MAD)

• important areas (IS, ML)

2. Big Data intensive approaches:

• HPC, ABDS, BDAS

3. Machine Learning tool generations

• SAS, Weka, Hadoop, Mahout, HaLoop, Spark (…)

4. Large scale (ML) algorithms comparison

• K-means, LogReg

“So, what’s getting ubiquitous and cheap? Data. And What is complementary to data? Analysis. “

The value of data analysis has entered common culture, to uncover the

unexpected in your data.

Why data analysis?

The MAD acronym, is made up of three inherent aspects on big data analysis: Magnetic: it concerns attracting data from heterogeneus sources, regardless of the quality of data.

How to make sense of data?

The MAD acronym, is made up of three inherent aspects on big data analysis: Agile: that is about how to make fastly analysis, to obtain action which maximizes the value for the business


The MAD acronym, is made up of three inherent aspects on big data analysis: Deep: is to enable analysts to know both sophisticated statistical methods and the most performing ML algorithms to study enormous datasets on distributed environments.


• Inferential statistics, that allows you to capture the underlying properties of the population (prediction, causality analysis and distributional comparison)

• Machine Learning, “…is the unsung hero that powers many of the

most sophisticated big data analytic applications”.

How to go deep?

DB design capture, modelling, manage, querying…

(SQL)

MAD skills, 2 key points

Programming Style extract, transform, process, investigate…

(MapReduce)

Parallel DBMSs are substantially faster than the MR system once the data is loaded, but that loading the data takes considerably longer in the

db system

MapReduce has captured the interest of many developers because of its simple 2-functions paradigm and it has widely viewed as a more attractive

programming environment than SQL

MR paradigm simplifies the schema-writing process for data: it just require to load and copy data into the storage system.

MAD design for smart environment!

As each approach has its own set of pros and cons, the proposal can be a database-Hadoop hybrid approach to scalable machine learning where batch-learning is performed on the Hadoop platform, and data are stored

(and organised) with the help of some parallel DBMSs.

The critical-skill for a MAD analysts becomes the interoperability on complex pipeline that includes some stage in SQL and some in

MapReduce syntax.

MAD design for smart environment!

• parallelizing and distributing data analysis

• large-scale data sets

• cluster and data fault tolerance

• iterative processing

How to deal with Big Data and Machine Learning?

BIG DATA INTENSIVE PARADIGMS

High Performance Computing is the use of parallel processing for running advanced application programs efficiently, reliably and quickly

parallel processing (MPI)

advance and high performance applications (Molecular Dynamics)

separating the cluster (VMs), compute (SLURM) and storage layer (LUSTRE)

supercomputingHPC stack

app

proc

comm

strg


Apache Big Data Stack Based on integration of compute and data, it introduces an application-level scheduling to facilitate heterogeneous application workloads and high-cluster utilization.

MapReduce paradigm

integration compute/data mgmt

cheap hw

low-need communication among clusters

many open-source implementations, support and docs tight coupling between storage (YARN) and resource (HDFS)

no shared memory

no support for iteration ABDS stack

app

proc

comm

strg


Berkeley Data Analytics Stack It emerge in response of application requirements (short-running tasks) and to overcome the problems of its predecessor (data-caching).

Transform and Act paradigm

multi-level scheduler (MESOS)

runtime iterative processing (SPARK)

distributed shared memory (RDD)

…young? BDAS stack

app

proc

comm

strg

FROM 2 PARADIGMS TO AN HYBRID TOOLHPC - data (intensive) parallel tasks workflows

+

ABDS - computes demanding on clusters and MapReduce style for batch-processing

=

BDAS - provides caching and shared memory

…

ML - remember that algorithms need iterative processing!

=> SPARK - Distributed framework for (big) data preparation and machine learning, based on Resilient (cache) system to recompute iterations

BIG DATA FRAMEWORK SPACEA

ge/M

atur

ity

Fast Data Big Analytics Big Application

THREE ML GENERATION OF TOOLS

First generation Traditional ML tools for machine learning (SAS, SPSS, Weka, R).

wide set of ML algorithms can facilitate deep analysis vertically scalable non distributed smaller data sets

Second generation ML tools built over Hadoop (Mahout, Pentaho, RapidMiner)

scale to large data sets distributed no database connectivity (ODBC) smaller sub-sets of algorithms low performance with multi-stage applications (e.g machine learning and graph processing) inefficient primitives for data sharing poor support for ad-hoc and interactive queries slow iterative computations

Third generation New purpose-tools (HaLoop, Twister, Pregel, GraphLab, Spark)

modularity shared memory iterative ML algorithms asynchronous graph processing cached memory across iterations/interactions

ML ALGORITHMS

K-means for clustering analysis. The iteration time of k-means is dominated by compute-intensive task of calculating the centroids from a set of datapoints. Logistic Regression, a type of probabilistic statistical classification model. For the comparison it is used for a binary classification task: it is less compute-intensive than k-means and more sensitive to time spent in deserialization and I/O.

K-MEANS

Times

�s�

a) b) c)

d)

iterations

machines iterations input

e) f)

a) b) c)

d)

Tim

es (s

)

iterations

Tim

es (s

)

iterations

LOG REG

CONCLUSIONS

• MAD design can help the analysis process, like the AGILE methodology helps the software development process.

• The better performance of parallel DBMSs should be complementary to MapReduce systems. • MapReduce provides powerful abstractions for data processing, analytics and machine learning to

the end-user that naturally involves in the new ”transform and act” paradigm used in Spark. • Spark takes the best techniques from both ABDS and HPC. It is the core of BDAS and is the best

framework in this scenario. • The resilient distributed datasets (RDDs) is an efficient, general-purpose and fault-tolerant

abstraction for sharing data in cluster applications, and it is the added value of Spark. • Frameworks like Twister and HaLoop are good candidates to be an alternative to Spark but they

do not appear to be mature enough.

Dr. Sasu Tarkoma Dr. Mohammad Hoque

Reviewers

Acknoledgements

([email protected])

Thank you!

mailto:[email protected]

mad skills for analysis and big data machine learning

Data & Analytics

sense of data

quality of data

value of data analysis

big data seminar

big data intensive approaches

analysis intro

fastly analysis

mad acronym