dataiku - hadoop ecosystem - @epitech paris - janvier 2014
DESCRIPTION
Overview of Hadoop and its ecosystem: Pig, Hive, GraphLab, Mahout, Impala, Spark, Storm, ... Presented at Epitech Paris on January 2014TRANSCRIPT
![Page 1: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/1.jpg)
How do ElephantMake Babies
Florian DouetteauCEO, Dataiku
![Page 2: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/2.jpg)
Agenda
• Part #1 Big Data
• Part #2 Why Hadoop, How, and When
• Part #3 Overview of the Coding EcosystemPig / Hive / Cascading
• Part #4 Overview of the Machine Learning EcosystemMahout
• Part #5 Overview of the Extended Ecosystem
![Page 3: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/3.jpg)
Dataiku 1/8/14
3
PART #1BIG
DATA
![Page 4: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/4.jpg)
Dataiku
Collocation
1/8/14
4
Big AppleBig MamaBig Data
A familiar grouping of words, especially words that habitually appear together and thereby convey meaning by association.
Collocation:
![Page 5: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/5.jpg)
Dataiku
“Big” Data in 1999
1/8/14
5
struct Element { Key key; void* stat_data ;}….
C Optimized Data structuresPerfect HashingHP-UNIX Servers – 4GB Ram100 GB dataWeb Crawler – Socket reuse HTTP 0.9
1 Month
![Page 6: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/6.jpg)
Dataiku
Hadoop Java / Pig / Hive / Scala / Closure / … A Dozen NoSQL data store MPP Databases Real-Time
1/8/14
6
Big Data in 2013
1 Hour
![Page 7: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/7.jpg)
To Hadoop
1 TB? $
Social Gaming2011Web Search
1999
Logistics2004
Online Advertising2012
1 TB100M $
E-Commerce2013Banking
CRM2008
1 TB1B $
Web Search2010
100 TB? $
10 TB10M $
1000TB500M $
50TB1B$
SQL OR AD HOC SQL + HADOOP
![Page 8: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/8.jpg)
Dataiku - Data Tuesday
Meet Hal Alowne
1/8/14
8
Big Guys• 10B$+ Revenue• 100M+ customers• 100+ Data Scientist
Hal AlowneBI ManagerDim’s Private Showroom
Hey Hal ! We need a big data platform
like the big guys.Let’s just do as they do!
‟”European E-commerce Web
site• 100M$ Revenue• 1 Million customer• 1 Data Analyst (Hal Himself)
Dim SumCEO & Founder Dim’s Private Showroom
Big DataCopy Cat Project
![Page 9: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/9.jpg)
QUESTION #1IS IT EASY OR
NOT ?
![Page 10: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/10.jpg)
SUBTLE PATTERNS
![Page 11: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/11.jpg)
"MORE BUSINESS"
BUTTONS
![Page 12: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/12.jpg)
QUESTION #2WHO TO HIRE ?
![Page 13: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/13.jpg)
DATA SCIENTISTAT NIGHT
![Page 14: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/14.jpg)
DATA CLEANERTHE DAY
![Page 15: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/15.jpg)
PARADOX #3
WHERE ?
![Page 16: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/16.jpg)
MY DATAIS WORTH MILLIONS
![Page 17: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/17.jpg)
I SEND IT TO THE MARKETING CLOUD
![Page 18: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/18.jpg)
QUERSTION #4IS IT BIG OR
NOT ?
![Page 19: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/19.jpg)
WE ALL LIVE IN A BIG DATA
LAKE
![Page 20: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/20.jpg)
ALL MY DATAPROBABLY FITS IN HERE
![Page 21: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/21.jpg)
QUESTION #5 (at last)
HUMAN OR NOT ?
![Page 22: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/22.jpg)
MACHINELEARNINGWILL SAVEUS ALL
![Page 23: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/23.jpg)
I JUST WANT MORE
REPORTS
![Page 24: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/24.jpg)
Dataiku
MERIT = TIME + ROI
1/9/14
24
Targeted Newsletter
RecommenderSystems
Adapted Product/ Promotions
TIME : 6 MONTHS
ROI : APPS
Build a lab in 6 months
(rather than 18 months)
Find the right people
(6 months?)
Choose the technology(6 months?)
Make it work (6 months?)
Build the lab (6 months)
Deploy apps that actually deliver value
2013 2014
2013
• Train People• Reuse working
patterns
![Page 25: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/25.jpg)
Dataiku
Statistics and Machine Learning is complex !
1/9/14
25
Try to understand myself
![Page 26: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/26.jpg)
Dataiku
(Some Book you might want to read)
1/9/14
26
![Page 27: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/27.jpg)
Dataiku - Pig, Hive and Cascading
CHOOSE TECHNOLOGY
HadoopCeph
Sphere
Cassandra
Kafka Flume
Spark Storm
Scikit-Learn GraphLAB prediction.io jubatusMahout
WEKAMLBase LibSVM
SASRapidMiner
SPSS Panda
QlickViewTableau
KibanaSpotFire D3
InfiniDB DrillVertica
GreenPlumImpalaNetezza
ElasticSearch
SOLR
MongoDBRiak CouchBase
Pig Cascading
Talend
Machine Learning Mystery Land
Scalability CentralNoSQL-Slavia
SQL Colunnar Republic
Vizualization County Data Clean
Wasteland
Statistician Old House
R Real-time island
![Page 28: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/28.jpg)
Dataiku
Business Intelligence Stack as Scalability and maintenance issues
Backoffice implements business rules that are challenged
Existing infrastructure cannot cope with per-user information
Main Pain Point:23 hours 52 minutes to compute Business Intelligence aggregates for one day.
1/9/14
28
Big Data Use Case #1Manage Volumes
![Page 29: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/29.jpg)
Dataiku - Data Tuesday
• Relieve their current DWH and
accelerate production of some aggregates/KPIs
• Be the backbone for new personalized user experience on their website: more recommendations, more profiling, etc.,
• Train existing people around machine learning and segmentation experience
1h12 to perform the aggregate, available every morning
New home page personalization deployed in a few weeks
Hadoop Cluster (24 cores)Google Compute EnginePython + R + Vertica12 TB dataset6 weeks projects
1/9/14
29
Big Data Use Case #1Manage Volumes
![Page 30: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/30.jpg)
Dataiku
A very large community
Some mid-size communities
Lots of small clusters mostly 2 players)
Correlation◦ between community size and
engagement / virality
Meaningul patterns◦ 2 players / Family / Group
What is the minimum number of friends to have in the application to get additional engagement ?
Big Data Use Case #2Find Patterns
1/9/14
30
![Page 31: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/31.jpg)
Dataiku - Pig, Hive and Cascading
How do I (pre)process data?
Implicit User Data(Views, Searches…)
Content Data(Title, Categories, Price, …)
Explicit User Data(Click, Buy, …)
User Information(Location, Graph…)
500TB
50TB
1TB
200GB
Transformation Matrix
Transformation Predictor
Per User Stats
Per Content Stats
User Similarity
Rank Predictor
Content Similarity
A/B Test Data
Predictor Runtime
Online User Information
![Page 32: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/32.jpg)
Always the same
Pour Data In
Compute Something
Smart About It
Make Available
![Page 33: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/33.jpg)
The Questions
Pour Data In
Compute Something
Smart About It
Make Available
How often ? What kind of interaction? How much ?
How complex ? Do you need all data at once ? How incremental ?
Interaction ? Random Access ?
![Page 34: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/34.jpg)
PART #2AT THE
BEGINNING
WAS THE ELEPHANT
![Page 35: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/35.jpg)
The Text Use Case
Pour Data In
Compute Something
Smart About It
Make Available
Large Volume 1TBTextual Like Data(Logs, Docs,….)
Massive Global TransformationThen Aggregation(Counting, Invert Index, ….)
Every Day
![Page 36: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/36.jpg)
What’s Difficult (back in 2000)
• Large Data won’t fit in one server
• Large computation (a few hours) are bound to fail one time or another
• Data is so big that my memory is too big to perform full aggregations
• Parallelization with threading is error-prone
• Data is so big that my Ethernet cable is not that big
![Page 37: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/37.jpg)
What’s Difficult (back in 2000)
• Large Data won’t fit in one server
• Large computation (a few hours) are bound to fail one time or another
• Data is so big that my memory is too big to perform full aggregations
• Parallelization with threading is error-prone
• Data is so big that my Ethernet cable is not that big
HDFS
JOB TRACKER
MAP REDUCE
![Page 38: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/38.jpg)
![Page 39: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/39.jpg)
![Page 40: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/40.jpg)
MapReduceHow to count works in many many boxes
![Page 41: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/41.jpg)
MapReducePREREQUESITES
41
GROUPS CAN BE DETERMINED
AT THE ROW LEVEL
AGGREGATION OPERATION IS IDEMPOTENT
![Page 42: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/42.jpg)
Questions ?
![Page 43: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/43.jpg)
PART #3CODING HADOOP
![Page 44: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/44.jpg)
Dataiku - Pig, Hive and Cascading
Yahoo Research in 2006 Inspired from Sawzall, a Google Paper from
2003 2007 as an Apache Project
Initial motivation◦ Search Log Analytics: how long is the average user
session ? how many links does a user click ? on before leaving a website ? how do click patterns vary in the course of a day/week/month ? …
Pig History
words = LOAD '/training/hadoop-wordcount/output‘ USING PigStorage(‘\t’)
AS (word:chararray, count:int);
sorted_words = ORDER words BY count DESC;first_words = LIMIT sorted_words 10;
DUMP first_words;
![Page 45: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/45.jpg)
Dataiku - Pig, Hive and Cascading
Developed by Facebook in January 2007
Open source in August 2008
Initial Motivation◦ Provide a SQL like abstraction to perform statistics on
status updates
Hive History
create external table wordcounts ( word string, count int) row format delimited fields terminated by '\t' location '/training/hadoop-wordcount/output';
select * from wordcounts order by count desc limit 10;
select SUM(count) from wordcounts where word like ‘th%’;
![Page 46: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/46.jpg)
Dataiku - Pig, Hive and Cascading
Authored by Chris Wensel 2008
Associated Projects◦ Cascalog : Cascading in Closure◦ Scalding : Cascading in Scala (Twitter in 2012)◦ Lingual ( to be released soon): SQL layer on
top of cascading
Cascading History
![Page 47: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/47.jpg)
Dataiku - Innovation Services
Pig & HiveMapping to Mapreduce jobs
1/8/14
47* VAT
excluded
events = LOAD ‘/events’ USING PigStorage(‘\t’) AS (type:chararray, user:chararray, price:int, timestamp:int);
events_filtered = FILTER events BY type;
by_user = GROUP events_filtered BY user;
price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price, MAX(timestamp) as max_ts;
high_pbu = FILTER price_by_user BY total_price > 1000;
Job 1 : Mapper Job 1 : Reducer1
LOAD FILTER GROUP FOREACH FILTERShuffle and sort by user
![Page 48: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/48.jpg)
Dataiku - Innovation Services
Pig & HiveMapping to Mapreduce jobs
1/8/14
48
events = LOAD ‘/events’ USING PigStorage(‘\t’) AS (type:chararray, user:chararray, price:int, timestamp:int);
events_filtered = FILTER events BY type;
by_user = GROUP events_filtered BY user;
price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price, MAX(timestamp) as max_ts;
high_pbu = FILTER price_by_user BY total_price > 1000;
recent_high = ORDER high_pbu BY max_ts DESC;
STORE recent_high INTO ‘/output’;
Job 1: Mapper Job 1 :Reducer
LOAD FILTER GROUP FOREACH FILTERShuffle and sort by user
Job 2: Mapper Job 2: Reducer
LOAD(from tmp)
STOREShuffle and sort by max_ts
![Page 49: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/49.jpg)
Dataiku - Pig, Hive and Cascading
Pig How does it work
Data Execution Plan compiled into 10 map reduce jobs executed in parallel (or not)
![Page 50: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/50.jpg)
Dataiku - Innovation Services
Reducer 2Mappers output
Reducer 1
Hive JoinsHow to join with MapReduce ?
1/8/14
50
tbl_idx uid name
1 1 Dupont
1 2 Durand
tbl_idx uid type
2 1 Type1
2 1 Type2
2 2 Type1
Shuffle by uidSort by (uid,
tbl_idx)
Uid Tbl_idx Name Type
1 1 Dupont
1 2 Type1
1 2 Type2
Uid Tbl_idx Name Type
2 1 Durand
2 2 Type1
Uid Name Type
1 Dupont Type1
1 Dupont Type2
Uid Name Type
2 Durand Type1
![Page 51: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/51.jpg)
WHAT IS THE BEST TOOL ?
![Page 52: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/52.jpg)
Dataiku - Pig, Hive and Cascading
Philosophy◦ Procedural Vs Declarative◦ Data Model and Schema
Productivity◦ Headachability ◦ Checkpointing◦ Testing and environment
Integration◦ Partitioning◦ Formats Integration◦ External Code Integration
Performance and optimization
Comparing without Comparable
![Page 53: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/53.jpg)
Dataiku - Pig, Hive and Cascading
Transformation as a sequence of operations
Transformation as a set of formulas
Procedural Vs Declarative
insert into ValuableClicksPerDMA select dma, count(*)from geoinfo join (
select name, ipaddr from users join clicks on (users.name = clicks.user)
where value > 0;) using ipaddr
group by dma;
Users = load 'users' as (name, age, ipaddr);Clicks = load 'clicks' as (user, url, value);ValuableClicks = filter Clicks by value > 0;UserClicks = join Users by name, ValuableClicks by user;Geoinfo = load 'geoinfo' as (ipaddr, dma);UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr;ByDMA = group UserGeo by dma;ValuableClicksPerDMA = foreach ByDMA generate group, COUNT(UserGeo);store ValuableClicksPerDMA into 'ValuableClicksPerDMA';
![Page 54: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/54.jpg)
Dataiku - Pig, Hive and Cascading
All three Extend basic data model with extended data types ◦ array-like [ event1, event2, event3] ◦ map-like { type1:value1, type2:value2, …}
Different approach◦ Resilient Schema ◦ Static Typing ◦ No Static Typing
Data type and ModelRationale
![Page 55: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/55.jpg)
Dataiku Training – Hadoop for Data Science
HiveData Type and Schema
1/8/14
55
Simple type Details
TINYINT, SMALLINT, INT, BIGINT 1, 2, 4 and 8 bytes
FLOAT, DOUBLE 4 and 8 bytes
BOOLEAN
STRING Arbitrary-length, replaces VARCHAR
TIMESTAMP
Complex type Details
ARRAY Array of typed items (0-indexed)
MAP Associative map
STRUCT Complex class-like objects
CREATE TABLE visit (user_name STRING,user_id INT,user_details STRUCT<age:INT, zipcode:INT>
);
![Page 56: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/56.jpg)
Dataiku Training – Hadoop for Data Science
rel = LOAD '/folder/path/'USING PigStorage(‘\t’)AS (col:type, col:type, col:type);
Data types and SchemaPig
1/8/14
56
Simple type Details
int, long, float, double
32 and 64 bits, signed
chararray A string
bytearray An array of … bytes
boolean A boolean
Complex type Details
tuple a tuple is an ordered fieldname:value map
bag a bag is a set of tuples
![Page 57: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/57.jpg)
Dataiku - Pig, Hive and Cascading
Support for Any Java Types, provided they can be serialized in Hadoop
No support for Typing
Data Type and Schema Cascading
Simple type Details
Int, Long, Float, Double
32 and 64 bits, signed
String A string
byte[] An array of … bytes
Boolean A boolean
Complex type Details
Object Object must be « Hadoop serializable »
![Page 58: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/58.jpg)
Dataiku - Pig, Hive and Cascading
Style Summary
Style Typing Data Model Metadata store
Pig Procedural Static + Dynamic
scalar + tuple+ bag
(fully recursive)
No (HCatalog)
Hive Declarative Static + Dynamic,
enforced at execution
time
scalar+ list + map
Integrated
Cascading Procedural Weak scalar+ java objects
No
![Page 59: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/59.jpg)
Dataiku - Pig, Hive and Cascading
Philosophy◦ Procedural Vs Declarative◦ Data Model and Schema
Productivity◦ Headachability ◦ Checkpointing◦ Testing, error management and environment
Integration◦ Partitioning◦ Formats Integration◦ External Code Integration
Performance and optimization
Comparing without Comparable
![Page 60: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/60.jpg)
Dataiku - Pig, Hive and Cascading
Does debugging the tool lead to bad headaches ?
HeadachilityMotivation
![Page 61: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/61.jpg)
Dataiku - Pig, Hive and Cascading
Out Of Memory Error (Reducer)
Exception in Building /
Extended Functions
(handling of null)
Null vs “”
Nested Foreach and scoping
Date Management (pig 0.10)
Field implicit ordering
HeadachesPig
![Page 62: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/62.jpg)
Dataiku - Pig, Hive and Cascading
A Pig Error
![Page 63: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/63.jpg)
Dataiku - Pig, Hive and Cascading
Out of Memory Errors in
Reducers
Few Debugging Options
Null / “”
No builtin “first”
HeadachesHive
![Page 64: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/64.jpg)
Dataiku - Pig, Hive and Cascading
Weak Typing Errors (comparing
Int and String … )
Illegal Operation Sequence
(Group after group …)
Field Implicit Ordering
HeadachesCascading
![Page 65: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/65.jpg)
Dataiku - Pig, Hive and Cascading
How to perform unit tests ? How to have different versions of the same script
(parameter) ?
TestingMotivation
![Page 66: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/66.jpg)
Dataiku - Pig, Hive and Cascading
System Variables Comment to test No Meta Programming pig –x local to execute on local files
TestingPig
![Page 67: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/67.jpg)
Dataiku - Pig, Hive and Cascading
Junit Tests are possible Ability to use code to actually comment out some
variables
Testing / Environment Cascading
![Page 68: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/68.jpg)
Dataiku - Pig, Hive and Cascading
Lots of iteration while developing on Hadoop Sometime jobs fail Sometimes need to restart from the start …
Checkpointing Motivation
Page User Correlation
OutputFilteringParse Logs
Per Page Stats
FAIL
FIX and relaunch
![Page 69: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/69.jpg)
Dataiku - Pig, Hive and Cascading
STORE Command to manually
store files
PigManual Checkpointing
Page User Correlation
OutputFilteringParse Logs
Per Page Stats
// COMMENT Beginning of script and relaunch
![Page 70: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/70.jpg)
Dataiku - Pig, Hive and Cascading
Ability to re-run a flow automatically from the last saved checkpoint
Cascading Automated Checkpointing
addCheckpoint(…)
![Page 71: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/71.jpg)
Dataiku - Pig, Hive and Cascading
Check each file intermediate timestamp Execute only if more recent
Cascading Topological Scheduler
Page User Correlation
OutputFilteringParse Logs
Per Page Stats
![Page 72: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/72.jpg)
Dataiku - Pig, Hive and Cascading
Productivity Summary
Headaches Checkpointing/Replay
Testing / Metaprogrammation
Pig Lots Manual Save Difficult Meta programming, easy
local testing
Hive Few, but without
debugging options
None (That’s SQL) None (That’s SQL)
Cascading Weak TypingComplexity
Checkpointing Partial Updates
Possible
![Page 73: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/73.jpg)
Dataiku - Pig, Hive and Cascading
Philosophy◦ Procedural Vs Declarative◦ Data Model and Schema
Productivity◦ Headachability ◦ Checkpointing◦ Testing and environment
Integration◦ Formats Integration◦ Partitioning◦ External Code Integration
Performance and optimization
Comparing without Comparable
![Page 74: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/74.jpg)
Dataiku - Pig, Hive and Cascading
Ability to integrate different file formats◦ Text Delimited◦ Sequence File (Binary Hadoop format)◦ Avro, Thrift ..
Ability to integrate with external data sources or sink ( MongoDB, ElasticSearch, Database. …)
Formats IntegrationMotivation
Format Size on Disk (GB) HIVE Processing time (24 cores)
Text File, uncompressed 18.7 1m32s
1 Text File, Gzipped 3.89 6m23s (no parallelization)
JSON compressed 7.89 2m42s
multiple text file gzipped 4.02 43s
Sequence File, Block, Gzip 5.32 1m18s
Text File, LZO Indexed 7.03 1m22s
Format impact on size and performance
![Page 75: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/75.jpg)
Dataiku - Pig, Hive and Cascading
Hive: Serde (Serialize-Deserializer) Pig : Storage Cascading: Tap
Format Integration
![Page 76: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/76.jpg)
Dataiku - Pig, Hive and Cascading
No support for “UPDATE” patterns, any increment is performed by adding or deleting a partition
Common partition schemas on Hadoop◦ By Date /apache_logs/dt=2013-01-23◦ By Data center /apache_logs/dc=redbus01/…◦ By Country◦ …◦ Or any combination of the above
PartitionsMotivation
![Page 77: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/77.jpg)
Dataiku Training – Hadoop for Data Science
Hive PartitioningPartitioned tables
1/8/14
77
CREATE TABLE event (user_id INT,type STRING,message STRING)
PARTITIONED BY (day STRING, server_id STRING);
Disk structure
/hive/event/day=2013-01-27/server_id=s1/file0/hive/event/day=2013-01-27/server_id=s1/file1/hive/event/day=2013-01-27/server_id=s2/file0/hive/event/day=2013-01-27/server_id=s2/file1…/hive/event/day=2013-01-28/server_id=s2/file0/hive/event/day=2013-01-28/server_id=s2/file1
INSERT OVERWRITE TABLE event PARTITION(ds='2013-01-27', server_id=‘s1’)SELECT * FROM event_tmp;
![Page 78: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/78.jpg)
Dataiku - Pig, Hive and Cascading
No Direct support for partition Support for “Glob” Tap, to build read from files using patterns
➔ You can code your own custom or virtual partition schemes
Cascading Partition
![Page 79: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/79.jpg)
Dataiku - Pig, Hive and Cascading
External Code IntegrationSimple UDF
Pig Hive
Cascading
![Page 80: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/80.jpg)
Dataiku - Pig, Hive and Cascading
Hive Complex UDF(Aggregators)
![Page 81: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/81.jpg)
Dataiku - Pig, Hive and Cascading
Cascading Direct Code Evaluation
Uses Janino, a very cool project: http://docs.codehaus.org/display/JANINO
![Page 82: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/82.jpg)
Dataiku - Pig, Hive and Cascading
IntegrationSummary
Partition/Incremental Updates
External Code Format Integration
Pig No Direct Support
Simple Doable and rich community
Hive Fully integrated, SQL Like
Very simple, but complex dev setup
Doable and existing community
Cascading With Coding Complex UDFS but regular, and Java Expression
embeddable
Doable and growing
commuinty
![Page 83: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/83.jpg)
Dataiku - Pig, Hive and Cascading
Philosophy◦ Procedural Vs Declarative◦ Data Model and Schema
Productivity◦ Headachability ◦ Checkpointing◦ Testing and environment
Integration◦ Formats Integration◦ Partitioning◦ External Code Integration
Performance and optimization
Comparing without Comparable
![Page 84: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/84.jpg)
Dataiku - Pig, Hive and Cascading
Several Common Map Reduce Optimization Patterns◦ Combiners◦ MapJoin◦ Job Fusion◦ Job Parallelism◦ Reducer Parallelism
Different support per framework◦ Fully Automatic◦ Pragma / Directives / Options◦ Coding style / Code to write
Optimization
![Page 85: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/85.jpg)
Dataiku - Pig, Hive and Cascading
SELECT date, COUNT(*) FROM product GROUP BY date
CombinerPerform Partial Aggregate at Mapper Stage
Map
Reduce2012-02-14 4354
…
2012-02-15 21we2
2012-02-14 qa334
…
2012-02-15 23aq2
2012-02-14 20
2012-02-15 35
2012-02-16 1
2012-02-14 4354
…
2012-02-15 21we2
2012-02-14 qa334
…
2012-02-15 23aq2
![Page 86: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/86.jpg)
Dataiku - Pig, Hive and Cascading
SELECT date, COUNT(*) FROM product GROUP BY date
CombinerPerform Partial Aggregate at Mapper Stage
Map
Reduce2012-02-14 4354
…
2012-02-15 21we2
2012-02-14 qa334
…
2012-02-15 23aq2
2012-02-14 12
2012-02-15 23
2012-02-16 1
2012-02-14 8
2012-02-15 12
2012-02-14 20
2012-02-15 35
2012-02-16 1
Reduced network bandwith. Better parallelism
![Page 87: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/87.jpg)
Dataiku - Pig, Hive and Cascading
Join OptimizationMap Join
set hive.auto.convert.join = true;
Hive
Pig
Cascading
( no aggregation support after HashJoin)
![Page 88: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/88.jpg)
Dataiku - Pig, Hive and Cascading
Critical for performance
Estimated per the size of input file◦ Hive
divide size per hive.exec.reducers.bytes.per.reducer (default 1GB)◦ Pig
divide size pig.exec.reducers.bytes.per.reducer (default 1GB)
Number of Reducers
![Page 89: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/89.jpg)
Dataiku - Pig, Hive and Cascading
CombinerOptimization
JoinOptimization
Number of reducers optimization
Pig Automatic Option Estimate or DIY
Cascading DIY HashJoin DIY
Hive PartialDIY
Automatic(Map Join)
Estimate or DIY
Performance & Optimization Summary
![Page 90: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/90.jpg)
Questions ?
![Page 91: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/91.jpg)
PART #4QUICK
MAHOUT
![Page 92: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/92.jpg)
Clustering
c
Revenue
Age
![Page 93: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/93.jpg)
Clustering
c
Revenue
Age
One Cluster
Centroid== Center
of the cluster
![Page 94: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/94.jpg)
clustering applications
• Fraud: Detect Outliers
• CRM : Mine for customer segments
• Image Processing : Similar Images
• Search : Similar documents
• Search : Allocate Topics
![Page 95: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/95.jpg)
K-Means
Guess an initial placement for centroids
Assign each point to closest Center
Reposition Center
MAP
REDUCE
![Page 96: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/96.jpg)
![Page 97: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/97.jpg)
![Page 98: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/98.jpg)
![Page 99: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/99.jpg)
![Page 100: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/100.jpg)
![Page 101: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/101.jpg)
![Page 102: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/102.jpg)
![Page 103: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/103.jpg)
![Page 104: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/104.jpg)
![Page 105: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/105.jpg)
clustering challenges
• Curse of Dimensionality
• Choice of distance / number of parameters
• Performance
• Choice # of clusters
![Page 106: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/106.jpg)
Mahout Clustering Challenges
• No Integrated Feature Engineering Stack:Get ready to write data processing in Java
• Hadoop SequenceFile required as an input
• Iterations as Map/Reduce read and write to disks: Relatively slow compared to in-memory processing
![Page 107: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/107.jpg)
![Page 108: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/108.jpg)
Data Processing
Data ProcessingVectorized
Data
Image
Voice
Log / DB
![Page 109: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/109.jpg)
Mahout K-Means on Text Workflow
mahoutseqdirectory
mahoutseq2parse
mahoutkmeans
Text Files
Mahout Sequence Files
Tfidf Vectors
Clusters
![Page 110: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/110.jpg)
Mahout K-Means on Database Extract Worflow
org.apache.mahout.clustering.conversion.InputDriver
mahoutkmeans
Database Dump (CSV)
Mahout Vectors
Clusters
![Page 111: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/111.jpg)
Convert a CSV File to Mahout Vector
• Real Code would have
• Converting Categorical variables to dimensions
• Variable Rescaling
• Dropping IDs (name, forname …)
![Page 112: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/112.jpg)
Mahout AlgorithmsParameters Implicit Assumption Ouput
K-MeansK (number of clusters)
ConvergenceCircles Point -> ClusterId
Fuzzy K-MeansK (number of clusters)
ConvergenceCircles
Point -> ClusterId * , Probability
Expectation Maximization
K (Number of clusterS)Convergence
Gaussian distributionPoint -> ClusterId*,
Probability
Mean-Shift Clustering
Distance boundaries, Convergence
Gradient like distribution Point -> Cluster ID
Top Down Clustering
Two Clustering Algorithns HierarchyPoint -> Large ClusterId,
Small ClusterId
Dirichlet Process
Model DistributionPoints are a mixture of
distributionPoint -> ClusterId,
Probability
Spectral Clustering
- - Point -> ClusterId
MinHash Clustering
Number of hash / keysHash Type
High Dimension Point -> Hash*
![Page 113: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/113.jpg)
Comparing ClusteringKMean
sDirichl
etFuzzy
KMeans
MeanShift
![Page 114: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/114.jpg)
Questions ?
![Page 115: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/115.jpg)
PART #5
ELEPHANT MAKE BABIES
![Page 116: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/116.jpg)
What if ?
Pour Data In
Compute Something
Smart About It
Make Available
Data Comes continously ?
Aggregation patterns are not “hashable”
Human Interaction requires results fast or incrementally available ?
![Page 117: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/117.jpg)
After Hadoop
Massive BatchMap Reduce Over
HDFS
Random Access
Faster in Memory Computation
In Memory MultiCore Machine Learning
Real-Time Distributed Computation
Faster SQL Analytics Queries
![Page 118: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/118.jpg)
HBase• Started by Powerset (now in Bing) in 2007
• Provide a key-value store on top of Hadoop
![Page 119: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/119.jpg)
HBASE
![Page 120: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/120.jpg)
GRAPHLAB• High-Perfomance, distributed computing framework, in C++
• Started in 2009, Carneggie-Mellon
• Main application in Machine Learning Tasks: Topic Modeling, Collaborative Filtering, Computer Vision
• Can read data in HDFS
![Page 121: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/121.jpg)
SPARK• Developped in 2010 at UC Berkeley
• Provide a distributed memory abstraction for efficient sequence of map/filter/join applications.
• Can Read/Store to HDFS or file
![Page 122: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/122.jpg)
SPARK
![Page 123: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/123.jpg)
STORM• Developped in 2011 by Nathan Marz at
BackType (then Twitter)
• Provide a framework for distributed real-time fault tolerant computation
• Not a message queuing system, a complex event processing system
![Page 124: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/124.jpg)
STORM
![Page 125: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/125.jpg)
STORM WITH HADOOP
![Page 126: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/126.jpg)
IMPALA• Started by Cloudera in 2012
• Provide real-time answers to SQL Queries on top of HDFS
![Page 127: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/127.jpg)
BENCHMARK
![Page 128: Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014](https://reader038.vdocument.in/reader038/viewer/2022102716/548d88abb47959fd1f8b4869/html5/thumbnails/128.jpg)
Questions ?