map reduce intro 130424032255 phpapp01
TRANSCRIPT
-
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
1/64
MapReduce Intro
The MapReduce Programming Model
Introduction and Examples
Dr. Jose Mara Alvarez-Rodrguez
Quality Management in Service-based Systems and CloudApplications
FP7 RELATE-ITNSouth East European Research Center
Thessaloniki, 10th of April, 2013
1 / 6 1
http://goforward/http://find/http://goback/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
2/64
-
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
3/64
MapReduce Intro
MapReduce in a nutshell
Features
A programming model...
1 Large-scale distributed data processing2 Simple but restricted
3 Paralell programming
4 Extensible
3 / 6 1
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
4/64
MapReduce Intro
MapReduce in a nutshell
Antecedents
Functional programming
1 Inspired
2 ...but not equivalent
Example in Python
Given a list of numbers between 1 and 50 print only evennumbers
p ri nt f il te r ( la mb da x : x % 2 = = 0 , r an ge ( 1, 5 0) )
A list of numbers (data)
A condition (even numbers)
A function filterthat is applied to the list (map)
4 / 6 1
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
5/64
MapReduce Intro
MapReduce in a nutshell
Antecedents
Functional programming
1 Inspired
2 ...but not equivalent
Example in Python
Given a list of numbers between 1 and 50 print only evennumbers
p ri nt f il te r ( la mb da x : x % 2 = = 0 , r an ge ( 1, 5 0) )
A list of numbers (data)
A condition (even numbers)
A function filterthat is applied to the list (map)
5 / 6 1
M R d I
http://find/http://goback/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
6/64
MapReduce Intro
MapReduce in a nutshell
...Other examples...
Example in Python
Return the sum of the squares of a list of numbers between 1 and50
import operatorr ed u ce ( o p e ra t or . a dd , m ap ( ( l am bd a x : x * *2 ) , r an ge ( 1 , 5 0) ) , 0 )
reduce is equivalent to foldl in other func. languages asHaskell
other math considerations should be taken into account (kindof operator)...
6 / 6 1
M R d I t
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
7/64
MapReduce Intro
MapReduce in a nutshell
Some interesting points...
The Map Reduce framework...
1 Inspired in functional programming concepts (but notequivalent)
2 Problems that can be paralellized
3 Sometimes recursive solutions
4
...
7 / 6 1
MapReduce Intro
http://goforward/http://find/http://goback/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
8/64
MapReduce Intro
MapReduce in a nutshell
Basic Model
MapReduce: The Programming Model and Practice, SIGMETRICS, Turorials 2009, Google.
8 / 6 1
MapReduce Intro
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
9/64
MapReduce Intro
MapReduce in a nutshell
Map Function
Figure: Mapping creates a new output list by applying a function to
individual elements of an input list.
Module 4: MapReduce, Hadoop Tutorial, Yahoo!.
9 / 6 1
MapReduce Intro
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
10/64
MapReduce Intro
MapReduce in a nutshell
Reduce Function
Figure: Reducing a list iterates over the input values to produce anaggregate value as output.
Module 4: MapReduce, Hadoop Tutorial, Yahoo!.
10/61
MapReduce Intro
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
11/64
MapReduce Intro
MapReduce in a nutshell
MapReduce Flow
Figure: High-level MapReduce pipeline.
Module 4: MapReduce, Hadoop Tutorial, Yahoo!.
11/61
MapReduce Intro
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
12/64
apReduce t o
MapReduce in a nutshell
MapReduce Flow
Figure: Detailed Hadoop MapReduce data flow.
12/61
MapReduce Intro
http://goforward/http://find/http://goback/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
13/64
p
MapReduce in a nutshell
Tip
What is MapReduce?
It is a framework inspired in functional programming to tackleproblems in which steps can be paralellized applying a divide andconquer approach.
13/61
MapReduce Intro
http://find/http://goback/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
14/64
Thinking in MapReduce
When should I use MapReduce?
Query
Index and Search: inverted index
Filtering
Classification
Recommendations: clustering or collaborative filtering
Analytics
Summarization and statistics
Sorting and merging
Frequency distribution
SQL-based queries: group-by, having, etc.
Generation of graphics: histograms, scatter plots.
Others
Message passing such as Breadth First-Search or PageRank algorithms.
14/61
MapReduce Intro
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
15/64
Thinking in MapReduce
When should I use MapReduce?
Query
Index and Search: inverted index
Filtering
Classification
Recommendations: clustering or collaborative filtering
Analytics
Summarization and statistics
Sorting and merging
Frequency distribution
SQL-based queries: group-by, having, etc.
Generation of graphics: histograms, scatter plots.
Others
Message passing such as Breadth First-Search or PageRank algorithms.
15/61
MapReduce Intro
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
16/64
Thinking in MapReduce
When should I use MapReduce?
Query
Index and Search: inverted index
Filtering
Classification
Recommendations: clustering or collaborative filtering
Analytics
Summarization and statistics
Sorting and merging
Frequency distribution
SQL-based queries: group-by, having, etc.
Generation of graphics: histograms, scatter plots.
Others
Message passing such as Breadth First-Search or PageRank algorithms.
16/61
MapReduce Intro
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
17/64
Thinking in MapReduce
How Google uses MapReduce (80% of data processing)
Large-scale web search indexing
Clustering problems for Google News
Produce reports for popular queries, e.g. Google Trend
Processing of satellite imagery data
Language model processing for statistical machine translation
Large-scale machine learning problems
. . .
17/61
MapReduce Intro
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
18/64
Thinking in MapReduce
Comparison of MapReduce and other approaches
MapReduce: The Programming Model and Practice, SIGMETRICS, Turorials 2009, Google.
18/61
MapReduce Intro
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
19/64
Thinking in MapReduce
Evaluation of MapReduce and other approaches
MapReduce: The Programming Model and Practice, SIGMETRICS, Turorials 2009, Google.
19/61
MapReduce Intro
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
20/64
Thinking in MapReduce
Apache Hadoop
MapReduce definition
The Apache Hadoop software
library is a framework thatallows for the distributedprocessing of large data setsacross clusters of computersusing simple programmingmodels.
Figure: Apache Hadoop Logo.
20/61
MapReduce Intro
Thi ki i M R d
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
21/64
Thinking in MapReduce
Tip
What can I do in MapReduce?
Three main functions:
1 Querying2 Summarizing
3 Analyzing
. . . large datasets in off-line mode for boosting other on-line
processes.
21/61
MapReduce Intro
A l i M R d
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
22/64
Applying MapReduce
MapReduce in Action
MapReduce Patterns
1 Summarization
2 Filtering
3
Data Organization (sort, merging, etc.)4 Relational-based (join, selection, projection, etc.)
5 Iterative Message Passing (graph processing)6 Others (depending on the implementation):
Simulation of distributed systemsCross-correlationMetapatternsInput-output. . .
22/61
MapReduce Intro
Applying MapReduce
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
23/64
Applying MapReduce
Overview (stages)-Counting Letters
23/61
MapReduce Intro
Applying MapReduce
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
24/64
Applying MapReduce
Summarization
Types
1 Numerical summarizations
2 Inverted index
3 Counting and counters
24/61
MapReduce Intro
Applying MapReduce
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
25/64
Applying MapReduce
Numerical Summarization-I
Description
A general pattern for calculating aggregate statistical values over
your data.
Intent
Group records together by a key field and calculate a numerical
aggregate per group to get a top-level view of the larger data set.
25/61
MapReduce Intro
Applying MapReduce
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
26/64
Applying MapReduce
Numerical Summarization-II
Applicability
To deal with numerical data or counting.
To group data by specific fields
Examples
1 Word count
2 Record count
3 Min/Max/Count
4 Average/Median/Standard deviation
5 . . .
26/61
MapReduce Intro
Applying MapReduce
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
27/64
pp y g p
Numerical Summarization-Pseudocode
class Mapper
method Map(recordid id, record r)
for all term t in record r do
Emit(term t, count 1)
class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] dosum = sum + c
Emit(term t, count sum)
27/61
MapReduce Intro
Applying MapReduce
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
28/64
pp y g p
Overview-Word Counter
28/61
MapReduce Intro
Applying MapReduce
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
29/64
Numerical Summarization-Word Counter
p u b li c v o id m a p ( L on g Wr i ta b le k ey , T ex t v al ue , C o nt e xt c o nt e xt )
throws E x c ep t i o n {S tr i ng l in e = v al ue . t o S tr i ng ( ) ;S t r i ng T o k en i z e r t o k en i z e r = new S t r i n g T o k e n i z e r ( l i n e ) ;while ( t o k en i z er . h a s M or e T o ke n s ( ) ) {
w o r d . s e t ( t o k e n i z e r . n e x t T o k e n ( ) ) ;
c o n te x t . w r i te ( w o rd , o n e ) ;}
}
p u b li c v o id r e d u c e ( T e x t k ey , I t e r ab l e < I n t W r i t a b le > v a l ue s ,C o n te x t c o n te x t )
throws I O E x c e p t i on , I n t e r r u p t e d E x c e p t i o n {in t s um = 0;fo r ( I n tW r it a bl e v al : v a lu e s ) {
s um + = v al . g et ( ) ;}c o n t e x t . w r i t e ( k e y , ne w I n t W r i t a b l e ( s u m ) ) ;
}
29/61
MapReduce Intro
Applying MapReduce
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
30/64
Example-II
Min/Max
Given a list of tweets (username, date, text) determine first andlast time an user commented and the number of times.
Implementation
See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro
30/61
MapReduce Intro
Applying MapReduce
https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-introhttps://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-introhttp://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
31/64
Overview - Min/Max
Min and max creation date are the same in the map phase.31/61
MapReduce IntroApplying MapReduce
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
32/64
Example II-Min/Max, function Map
p u b li c v o id m a p ( Ob j ec t k ey , T ex t v al ue , C o nt e xt c o nt e xt )
throws I O E x c e pt i o n , I n t e r r u p t e d E x c e pt i o n , P a r s e E x c e p t i o n {M ap < S tr i ng , S t ri ng > p a r se d = M R D PU t i l s . p a rs e ( v a l ue .
t o S t r i n g ( ) ) ;S t r in g s t r Da t e = p a r se d . g e t ( M R D P Ut i l s . C R E A T IO N _ D AT E ) ;
S t r in g u s e rI d = p a rs e d . g e t ( M R D PU t i ls . U S E R _ I D ) ;if ( s t r Da t e = = null | | u se rI d = = null ) {
return ;}D a te c r e a ti o n D at e = M R D PU t i ls . f r m t . p a r se ( s t r D a te ) ;o u t T u p l e . s e t M i n ( c r e a t i o n D a t e ) ;o u t T u p l e . s e t M a x ( c r e a t i o n D a t e ) ;o u t T u p l e . s e t C o u n t ( 1 ) ;o u t U s e r I d . s e t ( u s e r I d ) ;
c o n t e x t . w r i t e ( o u t U s e r I d , o u t T u p l e ) ;}
32/61
MapReduce IntroApplying MapReduce
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
33/64
Example II-Min/Max, function Reduce
p u b li c v o id r e d u c e ( T e x t k ey , I t e r ab l e < M i n M a x C o u n t T u p l e > v a l ue s ,
C o n te x t c o n te x t ) throws I O Ex c e pt i on , I n t e r ru p t e d Ex c e p t io n {r e s u l t . s e t M i n ( null ) ;r e s u l t . s e t M a x ( null ) ;int s um = 0;for ( M i n Ma x Co u nt T up l e v al : v a lu e s ) {
if ( r e s u lt . g e t M i n ( ) = = null| | v a l . g e tM i n ( ) . c o m pa r e To ( r e s u l t . g e t Mi n ( ) ) < 0 )
{r e s u l t . s e t M i n ( v a l . g e t M i n ( ) ) ;
}if ( r e s u lt . g e t M a x ( ) = = null
| | v a l . g e tM a x ( ) . c o m pa r e To ( r e s u l t . g e t Ma x ( ) ) > 0 ){
r e s u l t . s e t M a x ( v a l . g e t M a x ( ) ) ;}
s um + = v a l . g e tC o u nt ( ) ; }r e s u l t . s e t C o u n t ( s u m ) ;c o n te x t . w r i te ( k e y , r e s ul t ) ;
}
33/61
MapReduce IntroApplying MapReduce
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
34/64
Example-III
Average
Given a list of tweets (username, date, text) determine the averagecomment length per hour of day.
Implementation
See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro
34/61
MapReduce IntroApplying MapReduce
https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-introhttps://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-introhttp://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
35/64
Overview - Average
35/61
MapReduce IntroApplying MapReduce
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
36/64
Example III-Average, function Map
p u b li c v o id m a p ( Ob j ec t k ey , T ex t v al ue , C o nt e xt c o nt e xt )
throws I O E x c e pt i o n , I n t e r r u p t e d E x c e pt i o n , P a r s e E x c e p t i o n {M ap < S t ri ng , S tr in g > p ar se d =
M R D P U t i l s . p a r s e ( v a l u e . t o S t r i n g ( ) ) ;S t r in g s t r Da t e = p a r se d . g e t ( M R D P Ut i l s . C R E A T IO N _ D A TE ) ;S t r in g t e xt = p a r se d . g e t ( M R D P Ut i l s . T E XT ) ;if ( s t r Da t e = = null | | t ex t = = null ) {
return ;}D a te c r e a ti o n D at e = M R D PU t i ls . f r m t . p a r se ( s t r D a t e ) ;o u t H o u r . s e t ( c r e a t i o n D a t e . g e t H o u r s ( ) ) ;o u t C o u n t A v e r a g e . s e t C o u n t ( 1 ) ;o u t C o u n t A v e r a g e . s e t A v e r a g e ( t e x t . l e n g t h ( ) ) ;c o n t e x t . w r i t e ( o u t H o u r , o u t C o u n t A v e r a g e ) ;
}
36/61
MapReduce IntroApplying MapReduce
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
37/64
Example III-Average, function Reduce
p u b li c v o id r e d u c e ( I n t W r i t a b l e k e y , I t e r ab l e < C o u n t A v e r a g e T u p l e >
values ,C o n te x t c o n te x t ) throws I O Ex c e pt i on , I n t e r ru p t e d Ex c e p t io n {float s um = 0;
float c ou nt = 0 ;for ( C o u nt A ve r ag e Tu p le v al : v al u es ) {
s um + = v al . g e tC o un t ( ) * v al . g e t Av e ra g e () ;c o un t + = v a l . g e tC o u nt ( ) ;
}r e s u l t . s e t C o u n t ( c o u n t ) ;r e s ul t . s e t A v e ra g e ( s u m / c o un t ) ;c o n te x t . w r i te ( k e y , r e s ul t ) ;
}
37/61
MapReduce IntroApplying MapReduce
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
38/64
Numerical Summarization-Other approaches
Relation to SQL
S E L EC T M IN ( n u m c o l1 ) , M A X ( n u mc o l 1 ) ,C OU NT ( * ) F RO M t ab le G RO UP B Y g r ou p co l 2 ;
Implementation in PIG
b = G RO UP a B Y g ro up co l2 ;c = F OR E AC H b G E NE R AT E g ro up , M IN ( a . n um c ol 1 ) ,M A X ( a . n um c o l1 ) , C O U N T_ S T A R ( a ) ;
38/61
MapReduce IntroApplying MapReduce
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
39/64
Numerical Summarization-Other approaches
Relation to SQL
S E L EC T M IN ( n u m c o l1 ) , M A X ( n u mc o l 1 ) ,C OU NT ( * ) F RO M t ab le G RO UP B Y g r ou p co l 2 ;
Implementation in PIG
b = G RO UP a B Y g ro up co l2 ;c = F OR E AC H b G E NE R AT E g ro up , M IN ( a . n um c ol 1 ) ,M A X ( a . n um c o l1 ) , C O U N T_ S T A R ( a ) ;
39/61
MapReduce IntroApplying MapReduce
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
40/64
Filtering
Types
1 Filtering2 Top N records
3 Bloom filtering
4 Distinct
40/61
MapReduce IntroApplying MapReduce
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
41/64
Filtering-I
Description
It evaluates each record separately and decides, based on somecondition, whether it should stay or go.
Intent
Filter out records that are not of interest and keep ones that are.
41/61
MapReduce IntroApplying MapReduce
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
42/64
Filtering-II
Applicability
To collate data
Examples
1 Closer view of dataset
2 Data cleansing
3 Tracking a thread of events
4 Simple random sampling
5 Distributed Grep
6 Removing low scoring dataset
7 Log Analysis8 Data Querying
9 Data Validation
10 . . .
42/61
MapReduce IntroApplying MapReduce
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
43/64
Filtering-Pseudocode
class Mapper
method Map(recordid id, record r)
field f = extract(r)
if predicate (f)Emit(recordid id, value(r))
class Reducer
method Reduce(recordid id, values [r1, r2,...])
//Whatever
Emit(recordid id, aggregate (values))
43/61
MapReduce IntroApplying MapReduce
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
44/64
Example-IV
Distributed Grep
Given a list of tweets (username, date, text) determine the tweetsthat contain a word.
Implementation
See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro
44/61
MapReduce IntroApplying MapReduce
https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-introhttps://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-introhttp://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
45/64
Overview - Distributed Grep
45/61
MapReduce IntroApplying MapReduce
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
46/64
Example IV-Distributed Grep, function Map
p u b li c v o id m a p ( Ob j ec t k ey , T ex t v al ue , C o nt e xt c o nt e xt )
throws I O E x c e p t i on , I n t e r r u p t e d E x c e p t i o n {M ap < S t ri ng , S tr in g > p ar se d =
M R D P U t i l s . p a r s e ( v a l u e . t o S t r i n g ( ) ) ;S t r in g t x t = p a r se d . g e t ( M R D P Ut i l s . T E XT ) ;
S t ri ng m a pR e ge x = " . * \ \ b " + c o n t e x t . g e t C o n f i g u r a t i o n ( ). g e t ( " m a p r e g e x " ) + " ( . ) * \ \ b . * " ;if ( t x t . m a t c he s ( m a p R e ge x ) ) {
c o n te x t . w r i te ( N u l l W r i t ab l e . g e t ( ) , v a lu e ) ;}
}
...and the Reduce function?
In this case it is not necessary and output values are directly writing to the output.
46/61
MapReduce IntroApplying MapReduce
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
47/64
Example-V
Top 5
Given a list of tweets (username, date, text) determine the 5 usersthat wrote longer tweets
Implementation
See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro
47/61
MapReduce IntroApplying MapReduce
https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-introhttps://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-introhttp://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
48/64
Overview - Top 5
48/61
MapReduce IntroApplying MapReduce
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
49/64
Example V-Top 5, function Map
private T r e e Ma p < I n t e g er , T e xt > r e p T o R e c o r d M a p = ne w TreeMap ();p u b li c v o id m a p ( Ob j ec t k ey , T ex t v al ue , C o nt e xt c o nt e xt )
throws I O E x c e p t i on , I n t e r r u p t e d E x c e p t i o n {M ap < S t ri ng , S tr in g > p ar se d =M R D P U t i l s . p a r s e ( v a l u e . t o S t r i n g ( ) ) ;if ( p a rs e d = = null ) { return ;}S t r in g u s er I d = p a r se d . g e t ( M R D P Ut i l s . U S E R_ I D ) ;S t r in g r e p u ta t i o n = S t ri n g . v a l u eO f ( p a r s ed . g e t ( M R D P U ti l s .
T E X T ) . l e n g t h ( ) ) ;/ / Ma x r e pu t at i on i f y ou w ri te t w ee ts l on g erif ( u s er I d = = null | | r e pu t at i on = = null ) { return ;}
r e p T o R e c o r d M a p . p u t ( I n t e g e r . p a r s e I n t ( r e p u t a t i o n ) , ne wT e x t ( v a l u e ) ) ;
if ( r e p T oR e c o rd M a p . s i ze ( ) > M A X _T O P ) {r e p T o R e c o r d M a p . r e m o v e ( r e p T o R e c o r d M a p . f i r s t K e y ( )
) ;}
}
49/61
MapReduce IntroApplying MapReduce
E l V T f R d
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
50/64
Example V-Top 5, function Reduce
p u b li c v o id r e d uc e ( N u l l W r i ta b l e k ey , I t er a bl e < T ex t > v a lu e s ,
C o n te x t c o n te x t ) throws I O Ex c e pt i on , I n t e r ru p t e d Ex c e p t io n {fo r ( T ex t v al ue : v al ue s ) {M ap < S t ri n g , S t ri n g > p a rs e d = M R D P Ut i l s . p a r se ( v a l u e .
t o S t r i n g ( ) ) ;r e p T o R e c o r d M a p . p u t ( p a r s e d . g e t ( M R D P U t i l s . T E X T ) . l e n g t h
() , new T e x t ( v a l u e ) ) ;if ( r e p T oR e c o rd M a p . s i ze ( ) > M A X _T O P ) {
r e p T o R e c o r d M a p . r e m o v e ( r e p T o R e c o r d M a p . f i r s t K e y ( )) ;
}}
fo r ( T e xt t : r e p T oR e c o r dM a p . d e s c e n d in g M a p ( ) . v a lu e s ( )) {
c o n te x t . w r i te ( N u l l W r i ta b l e . g e t ( ) , t ) ;
}}
50/61
MapReduce IntroApplying MapReduce
Fil i O h h
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
51/64
Filtering-Other approaches
Relation to SQL
S EL E CT * F RO M t ab le W HE RE c o lv a lu e < V A LU E ;
Implementation in PIG
b = F IL TE R a BY c ol va lu e < V AL UE ;
51/61
MapReduce IntroApplying MapReduce
Fil i O h h
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
52/64
Filtering-Other approaches
Relation to SQL
S EL E CT * F RO M t ab le W HE RE c o lv a lu e < V A LU E ;
Implementation in PIG
b = F IL TE R a BY c ol va lu e < V AL UE ;
52/61
MapReduce IntroApplying MapReduce
Ti
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
53/64
Tip
How can I use and run a MapReduce framework?
You should identify what kind of problem you are addressing andapply a design pattern to be implemented in a framework suchas Apache Hadoop.
53/61
MapReduce Intro
Success Stories with MapReduce
Ti
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
54/64
Tip
Who is using MapReduce?
All companies that are dealing with Big Data problems for
analytics such as:Cloudera
Datasalt
Elasticsearch
. . .
54/61
MapReduce Intro
Success Stories with MapReduce
Apache Hadoop Related Projects
http://goforward/http://find/http://goback/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
55/64
Apache Hadoop-Related Projects
55/61
MapReduce Intro
Success Stories with MapReduce
More tips
http://goforward/http://find/http://goback/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
56/64
More tips
FAQ
MapReduce is a framework based on a simple programmingmodel
...to deal with large datasets in a distributed fashion
...scalability, replication, fault-tolerant, etc.
Apache Hadoop is not a database
New frameworks on top of Hadoop for specific tasks:querying, analysis, etc.
Other similar frameworks: Storm, Signal/Collect, etc.
. . .
56/61
MapReduce Intro
Summary and Conclusions
Summary
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
57/64
Summary
57/61
MapReduce Intro
Summary and Conclusions
Conclusions
http://goforward/http://find/http://goback/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
58/64
Conclusions
What is MapReduce?
It is a framework inspired in functional programming to tackle problems in which steps can be paralellizedapplying a divide and conquer approach.
What can I do in MapReduce?
Three main functions:
1 Querying
2 Summarizing
3 Analyzing
. . . large datasets in off-line mode for boosting other on-line processes.
How can I use and run a MapReduce framework?
You should identify what kind of problem you are addressing and apply a design pattern to be implemented in aframework such as Apache Hadoop.
58/61
MapReduce Intro
Summary and Conclusions
Conclusions
http://goforward/http://find/http://goback/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
59/64
Conclusions
What is MapReduce?
It is a framework inspired in functional programming to tackle problems in which steps can be paralellizedapplying a divide and conquer approach.
What can I do in MapReduce?
Three main functions:
1 Querying
2 Summarizing
3 Analyzing
. . . large datasets in off-line mode for boosting other on-line processes.
How can I use and run a MapReduce framework?
You should identify what kind of problem you are addressing and apply a design pattern to be implemented in aframework such as Apache Hadoop.
59/61
MapReduce Intro
Summary and Conclusions
Conclusions
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
60/64
Conclusions
What is MapReduce?
It is a framework inspired in functional programming to tackle problems in which steps can be paralellizedapplying a divide and conquer approach.
What can I do in MapReduce?
Three main functions:
1 Querying
2 Summarizing
3 Analyzing
. . . large datasets in off-line mode for boosting other on-line processes.
How can I use and run a MapReduce framework?
You should identify what kind of problem you are addressing and apply a design pattern to be implemented in aframework such as Apache Hadoop.
60/61
MapReduce Intro
Summary and Conclusions
Whats next?
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
61/64
What s next?
. . .
Concatenate MapReduce jobs
Optimization using combiners and setting the parameters (sizeof partition, etc.)
Pipelining with other languages such as Python
Hadoop in Action: more examples, etc.
New trending problems (image/video processing)
Real-time processing. . .
61/61
MapReduce Intro
References
J Dean and S Ghema at
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
62/64
J. Dean and S. Ghemawat.MapReduce: simplified data processing on large clusters.
Commun. ACM, 51(1):107113, Jan. 2008.J. L. Jonathan R. Owens, Brian Femiano.Hadoop Real-World Solutions Cookbook.Packt Publishing Ltd, 2013.
C. Lam.Hadoop in Action.Manning Publications Co., Greenwich, CT, USA, 1st edition,2010.
J. Lin and C. Dyer.Data-intensive text processing with MapReduce.In Proceedings of Human Language Technologies: The 2009Annual Conference of the North American Chapter of theAssociation for Computational Linguistics, Companion
62/61
MapReduce Intro
References
Volume: Tutorial Abstracts, NAACL-Tutorials 09, pages 12,
http://goforward/http://find/http://goback/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
63/64
, , p g ,Stroudsburg, PA, USA, 2009. Association for ComputationalLinguistics.
D. Miner and A. Shook.Mapreduce Design Patterns.Oreilly and Associates Inc, 2012.
T. G. Srinath Perera.Hadoop MapReduce Cookbook.Packt Publishing Ltd, 2013.
T. White.Hadoop: The Definitive Guide.
OReilly Media, Inc., 1st edition, 2009.
I. H. Witten and E. Frank.Data Mining: Practical Machine LearningTools and Techniques.
63/61
MapReduce Intro
References
Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,
http://find/ -
8/12/2019 Map Reduce Intro 130424032255 Phpapp01
64/64
g2005.
64/61
http://goforward/http://find/http://goback/