mad skills for analysis and big data machine learning
DESCRIPTION
A brief introduction and comparison of big data approaches, framework and machine learning algorithmsTRANSCRIPT
MAD SKILLS FOR ANALYSIS AND
BIG DATA MACHINE LEARNING
University of Helsinki
Gianvito Siciliano
(2014 - Distributed Computing Frameworks for Big Data Seminar)
COMPARISON OF
• APPROACHES
• PLATFORMS
• ALGORITHMS
AGENDA
1. Analysis intro:
• needed skills (MAD)
• important areas (IS, ML)
2. Big Data intensive approaches:
• HPC, ABDS, BDAS
3. Machine Learning tool generations
• SAS, Weka, Hadoop, Mahout, HaLoop, Spark (…)
4. Large scale (ML) algorithms comparison
• K-means, LogReg
“So, what’s getting ubiquitous and cheap? Data. And What is complementary to data? Analysis. “
The value of data analysis has entered common culture, to uncover the
unexpected in your data.
Why data analysis?
The MAD acronym, is made up of three inherent aspects on big data analysis: Magnetic: it concerns attracting data from heterogeneus sources, regardless of the quality of data.
How to make sense of data?
The MAD acronym, is made up of three inherent aspects on big data analysis: Agile: that is about how to make fastly analysis, to obtain action which maximizes the value for the business
How to make sense of data?
The MAD acronym, is made up of three inherent aspects on big data analysis: Deep: is to enable analysts to know both sophisticated statistical methods and the most performing ML algorithms to study enormous datasets on distributed environments.
How to make sense of data?
• Inferential statistics, that allows you to capture the underlying properties of the population (prediction, causality analysis and distributional comparison)
• Machine Learning, “…is the unsung hero that powers many of the
most sophisticated big data analytic applications”.
How to go deep?
DB design capture, modelling, manage, querying…
(SQL)
MAD skills, 2 key points
Programming Style extract, transform, process, investigate…
(MapReduce)
Parallel DBMSs are substantially faster than the MR system once the data is loaded, but that loading the data takes considerably longer in the
db system
MapReduce has captured the interest of many developers because of its simple 2-functions paradigm and it has widely viewed as a more attractive
programming environment than SQL
MR paradigm simplifies the schema-writing process for data: it just require to load and copy data into the storage system.
MAD design for smart environment!
As each approach has its own set of pros and cons, the proposal can be a database-Hadoop hybrid approach to scalable machine learning where batch-learning is performed on the Hadoop platform, and data are stored
(and organised) with the help of some parallel DBMSs.
The critical-skill for a MAD analysts becomes the interoperability on complex pipeline that includes some stage in SQL and some in
MapReduce syntax.
MAD design for smart environment!
• parallelizing and distributing data analysis
• large-scale data sets
• cluster and data fault tolerance
• iterative processing
How to deal with Big Data and Machine Learning?
BIG DATA INTENSIVE PARADIGMS
High Performance Computing is the use of parallel processing for running advanced application programs efficiently, reliably and quickly
parallel processing (MPI)
advance and high performance applications (Molecular Dynamics)
separating the cluster (VMs), compute (SLURM) and storage layer (LUSTRE)
supercomputingHPC stack
app
proc
comm
strg
BIG DATA INTENSIVE PARADIGMS
Apache Big Data Stack Based on integration of compute and data, it introduces an application-level scheduling to facilitate heterogeneous application workloads and high-cluster utilization.
MapReduce paradigm
integration compute/data mgmt
cheap hw
low-need communication among clusters
many open-source implementations, support and docs tight coupling between storage (YARN) and resource (HDFS)
no shared memory
no support for iteration ABDS stack
app
proc
comm
strg
BIG DATA INTENSIVE PARADIGMS
Berkeley Data Analytics Stack It emerge in response of application requirements (short-running tasks) and to overcome the problems of its predecessor (data-caching).
Transform and Act paradigm
multi-level scheduler (MESOS)
runtime iterative processing (SPARK)
distributed shared memory (RDD)
…young? BDAS stack
app
proc
comm
strg
FROM 2 PARADIGMS TO AN HYBRID TOOLHPC - data (intensive) parallel tasks workflows
+
ABDS - computes demanding on clusters and MapReduce style for batch-processing
=
BDAS - provides caching and shared memory
…
ML - remember that algorithms need iterative processing!
=> SPARK - Distributed framework for (big) data preparation and machine learning, based on Resilient (cache) system to recompute iterations
BIG DATA FRAMEWORK SPACEA
ge/M
atur
ity
Fast Data Big Analytics Big Application
THREE ML GENERATION OF TOOLS
First generation Traditional ML tools for machine learning (SAS, SPSS, Weka, R).
wide set of ML algorithms can facilitate deep analysis vertically scalable non distributed smaller data sets
Second generation ML tools built over Hadoop (Mahout, Pentaho, RapidMiner)
scale to large data sets distributed no database connectivity (ODBC) smaller sub-sets of algorithms low performance with multi-stage applications (e.g machine learning and graph processing) inefficient primitives for data sharing poor support for ad-hoc and interactive queries slow iterative computations
Third generation New purpose-tools (HaLoop, Twister, Pregel, GraphLab, Spark)
modularity shared memory iterative ML algorithms asynchronous graph processing cached memory across iterations/interactions
ML ALGORITHMS
K-means for clustering analysis. The iteration time of k-means is dominated by compute-intensive task of calculating the centroids from a set of datapoints. Logistic Regression, a type of probabilistic statistical classification model. For the comparison it is used for a binary classification task: it is less compute-intensive than k-means and more sensitive to time spent in deserialization and I/O.
K-MEANS
Times
�s�
a) b) c)
d)
iterations
machines iterations input
e) f)
a) b) c)
d)
Tim
es (s
)
iterations
Tim
es (s
)
iterations
LOG REG
CONCLUSIONS
• MAD design can help the analysis process, like the AGILE methodology helps the software development process.
• The better performance of parallel DBMSs should be complementary to MapReduce systems. • MapReduce provides powerful abstractions for data processing, analytics and machine learning to
the end-user that naturally involves in the new ”transform and act” paradigm used in Spark. • Spark takes the best techniques from both ABDS and HPC. It is the core of BDAS and is the best
framework in this scenario. • The resilient distributed datasets (RDDs) is an efficient, general-purpose and fault-tolerant
abstraction for sharing data in cluster applications, and it is the added value of Spark. • Frameworks like Twister and HaLoop are good candidates to be an alternative to Spark but they
do not appear to be mature enough.
Dr. Sasu Tarkoma Dr. Mohammad Hoque
Reviewers
Acknoledgements