setting up a mini big data architecture, just for you! - bas geerdink
Post on 02-Jul-2015
341 Views
Preview:
DESCRIPTION
TRANSCRIPT
Building a (mini) Big Data
architectureBas Geerdink5 november 2014
About me
• Work: ING
• Education: Master’s degree in AI and Informatics
• Programming since 1998 (C#, Java, Scala, Python, …)
• Twitter: @bgeerdink
• Email: bas.geerdink@ing.nl
Introduction
• Big Data– Volume, Velocity, Variety
• Predictive Analytics / Machine Learning– Classification– Clustering– Recommendation
• Today’s goal:– Start small, create a playground!– Learn some basic tools and techniques
Reference big data solution architecture
• On-premise:– Hortonworks– Cloudera– MapR– IBM InfoSphere BigInsights– HP Vertica– Oracle– Teradata– SAS
• Cloud-based:– Amazon Elastic MapReduce– Microsoft Azure HDInsight– Google (App Engine, BigTable, Prediction API, …)– SAP HANA
… however, we’ll set up our own environment!
There are several out-of-the-box options to get started with big data development
Mahout features
• Optimized for large datasets (millions of records)
• Moving from Hadoop to Spark
• Supervised learning
– Classification: Naïve Bayes, Hidden Markov Models(NN), Random Forest
– Logistic Regression (predict a continuous value)
• Unsupervised learning
– Clustering: k-Means, Canopy
– Recommendations
Mahout AlgorithmsSize of dataset
Mahoutalgorithm
Executionmodel
Characteristics
Small SGD Sequential Uses all types of predictor vars
Medium (Complementary)Naïve Bayes
Parallel Prefers text, high training cost
Large Random Forest Parallel Uses all types of predictor vars, high training cost
Source: Cloudera (2011)
Example 1: newsgroups
• Data: newsgroup items
• 20.000 records
• Train with Naïve Bayes Classifier
• Categories: 20 newsgroups
• Prediction: newsgroup of
unclassified item
Example 2: hospital treatment
• Data: hospital surgeries in 50s, 60s, 70s
• 306 records
• Train with logistic regression
• Features:– Age of subject
– Year of treatment
– Number of positive axillary nodes
• Prediction: survival rate
• Visualization: D3.js
Summary
Want to move on?
• Follow courses on Coursera– Machine Learning: https://www.coursera.org/course/ml
– Introduction to Data Science: https://www.coursera.org/course/datasci
• Read Hadoop/Mahout/R tutorials and books
• Get some ML datasets: – http://archive.ics.uci.edu/ml/datasets.html
– http://aws.amazon.com/datasets
– http://www.datasciencecentral.com/profiles/blogs/big-data-sets-available-for-free
• Expand the ecosystem: Hive, Pig, HBase, Spark, …
top related