setting up a mini big data architecture, just for you! - bas geerdink

Building a (mini) Big Data

architectureBas Geerdink5 november 2014

About me

• Work: ING

• Education: Master’s degree in AI and Informatics

• Programming since 1998 (C#, Java, Scala, Python, …)

• Twitter: @bgeerdink

• Email: bas.geerdink@ing.nl

Introduction

• Big Data– Volume, Velocity, Variety

• Predictive Analytics / Machine Learning– Classification– Clustering– Recommendation

• Today’s goal:– Start small, create a playground!– Learn some basic tools and techniques

Reference big data solution architecture

• On-premise:– Hortonworks– Cloudera– MapR– IBM InfoSphere BigInsights– HP Vertica– Oracle– Teradata– SAS

• Cloud-based:– Amazon Elastic MapReduce– Microsoft Azure HDInsight– Google (App Engine, BigTable, Prediction API, …)– SAP HANA

… however, we’ll set up our own environment!

There are several out-of-the-box options to get started with big data development

Mahout features

• Optimized for large datasets (millions of records)

• Moving from Hadoop to Spark

• Supervised learning

– Classification: Naïve Bayes, Hidden Markov Models(NN), Random Forest

– Logistic Regression (predict a continuous value)

• Unsupervised learning

– Clustering: k-Means, Canopy

– Recommendations

Mahout AlgorithmsSize of dataset

Mahoutalgorithm

Executionmodel

Characteristics

Small SGD Sequential Uses all types of predictor vars

Medium (Complementary)Naïve Bayes

Parallel Prefers text, high training cost

Large Random Forest Parallel Uses all types of predictor vars, high training cost

Source: Cloudera (2011)

Example 1: newsgroups

• Data: newsgroup items

• 20.000 records

• Train with Naïve Bayes Classifier

• Categories: 20 newsgroups

• Prediction: newsgroup of

unclassified item

Example 2: hospital treatment

• Data: hospital surgeries in 50s, 60s, 70s

• 306 records

• Train with logistic regression

• Features:– Age of subject

– Year of treatment

– Number of positive axillary nodes

• Prediction: survival rate

• Visualization: D3.js

Summary

Want to move on?

• Follow courses on Coursera– Machine Learning: https://www.coursera.org/course/ml

– Introduction to Data Science: https://www.coursera.org/course/datasci

• Read Hadoop/Mahout/R tutorials and books

• Get some ML datasets: – http://archive.ics.uci.edu/ml/datasets.html

– http://aws.amazon.com/datasets

– http://www.datasciencecentral.com/profiles/blogs/big-data-sets-available-for-free

• Expand the ecosystem: Hive, Pig, HBase, Spark, …

setting up a mini big data architecture, just for you! - bas geerdink

Technology

bas govers

mini-link bas -...

ceramics clay tiles in bas relief clay tiles in bas relief

bas - idaho

bas-311hn bas-326h bas-341h bas-342h bas …...created the...

bas-311hn bas-326h bas-341h bas-342h bas-341h bas-342h -...

bas outline

adsl2+ ip dslam - svpro.ru guide for bas-8124... ·...

bas-341h instruction manual bas-342h -...

spark summit eu talk by bas geerdink

bas 18

fransje l. hooimeijer, hanneke puts, tara geerdink

marketing (bas)

bas - information systems engineering networking &...

mini-link bas - cosconor.fr mini... · mini-link™ bas...

brochure | j. geerdink handel bv

bas uncertainty

bas-300g instruction manual bas-311g bas … bas-311g,...

bas-311hn bas-326h bas-341h bas-342h bas ... - · pdf...

igbt for drives inkl ts600v outlook - home - faculdade...