big data: pig latinpjm/co572/lectures/bigdata-handout.pdf · a big datasystem is able to handle:...
Post on 23-Aug-2020
6 Views
Preview:
TRANSCRIPT
Big Data: Pig Latin
P.J. McBrien
Imperial College London
P.J. McBrien (Imperial College London) Big Data: Pig Latin 1 / 44
Introduction
Scale Up
1GB 1TB 1PB
Scale Up
As the amount of data increase, buy a larger computer to hold that data
P.J. McBrien (Imperial College London) Big Data: Pig Latin 2 / 44
Introduction
Scale Out
1GB 1TB 1PB
. . .
Scale Out
As the amount of data increase, buy more commodity computers to spread the data
P.J. McBrien (Imperial College London) Big Data: Pig Latin 3 / 44
Introduction
CAP Theorem
CAP Theorem
No distributed system may maintain all three of
Consistency: all nodes see the same version of data
Availability: the system always responds within fixed upper limits of time
Partition Tolerance: the system always is available even when messages arelost or network failures occur
C
A P
CA CP
AP
CAe.g. Centralised Database
CPe.g. Distributed RDBMS
APe.g. DNS
P.J. McBrien (Imperial College London) Big Data: Pig Latin 4 / 44
Introduction
What is Big Data System?
H1
S1
H2
S2
LAN
. . .
Hn−1
Sn−1
Hn
Sn
a big data system is able to handle:
more data than fits on a commodity computer (TBs or PBs data)
data spread over hundreds or thousands of servers
failures of nodes without loss of data
Consequence of CAP Theorem
availability prioritised over consistency
P.J. McBrien (Imperial College London) Big Data: Pig Latin 5 / 44
Introduction
Data Models
Key-Value
Key-Value pairs
Schema-less
Very limited querying capabilities: Useful for implementing cache
e.g. Memcache, Redis
P.J. McBrien (Imperial College London) Big Data: Pig Latin 6 / 44
Introduction
Data Models
Document
Document (semi-structured) data model (e.g. JSON)
Schema-less
Support queries searching fields within a document
Use MapReduce for OLAP
e.g. CouchDB, MongoDB
P.J. McBrien (Imperial College London) Big Data: Pig Latin 6 / 44
Introduction
Data Models
Wide Column
Table data model, with easy addition of new columns
Columns put into families (and hence allows vertical fragmentation on families)
Schema-less
Support queries searching field values
Use MapReduce for OLAP
e.g. BigTable, HBase, Cassandra
P.J. McBrien (Imperial College London) Big Data: Pig Latin 6 / 44
Introduction
Data Models
Relational
Relational data model
Schema based
Support queries searching fields and performing joins
ACID properties of transactions
e.g. MySQL Cluster, VoltDB
P.J. McBrien (Imperial College London) Big Data: Pig Latin 6 / 44
Introduction
Data Models
Graph
Graph model: nodes and edges (e.g. RDF)
Schema-less
e.g. Neo4J, StarDog
P.J. McBrien (Imperial College London) Big Data: Pig Latin 6 / 44
MapReduce
MapReduce
D1
D2
D3
D4
M1
M2
M3
M4
M5
R1
R2
R3
✲✲
✲✲
✲✲✲
✲✲
✲✲
③
❘
❘
❃
❘
❘
✒
✲
❘
✒
✒
⑦
✒
✒
✿
datanodes
loadmapnodes shuffle
reducenodes
P.J. McBrien (Imperial College London) Big Data: Pig Latin 7 / 44
MapReduce
MapReduce: Map Phase of Word Count
M1
The First Lord of the Admiralty inhis speech the other night went evenfarther. He said, ‘We are alwaysreviewing the position’. Everything,he assured us is entirely fluid. I amsure that that is true. Anyone cansee what the position is. TheGovernment
(the,1) (first,1) (lord,1) (of,1) (the,1)(admiralty,1) (in,1) (his,1) (speech,1)(the,1) (other,1) (night,1) (went,1)(even,1) (farther,1) (he,1) (said,1) (we,1)(are,1) (always,1) (reviewing,1) (the,1)(position,1) (everything,1) (he,1)(assured,1) (us,1) (is,1) (entirely,1)(fluid,1) (i,1) (am,1) (sure,1) (that,1)(that,1) (is,1) (true,1) (anyone,1) (can,1)(see,1) (what,1) (the,1) (position,1)(is,1) (the,1) (government,1)
M2
simply cannot make up their minds,or they cannot get the PrimeMinister to make up his mind. Sothey go on in strange paradox,decided only to be undecided,resolved to be irresolute, adamantfor drift, solid for fluidity,all-powerful to be impotent.
(simply,1) (cannot,1) (make,1) (up,1)(their,1) (minds,1) (or,1) (they,1)(cannot,1) (get,1) (the,1) (prime,1)(minister,1) (to,1) (make,1) (up,1)(his,1) (mind,1) (so,1) (they,1) (go,1)(on,1) (in,1) (strange,1) (paradox,1)(decided,1) (only,1) (to,1) (be,1)(undecided,1) (resolved,1) (to,1) (be,1)(irresolute,1) (adamant,1) (for,1)(drift,1) (solid,1) (for,1) (fluidity,1)(all-powerful,1) (to,1) (be,1)(impotent,1)
P.J. McBrien (Imperial College London) Big Data: Pig Latin 8 / 44
MapReduce
MapReduce: Shuffle Phase of Word Count
M1
(the,1) (first,1) (lord,1) (of,1)(the,1) (admiralty,1) (in,1) (his,1)(speech,1) (the,1) (other,1)(night,1) (went,1) (even,1)(farther,1) (he,1) (said,1) (we,1)(are,1) (always,1) (reviewing,1)(the,1) (position,1) (everything,1)(he,1) (assured,1) (us,1) (is,1)(entirely,1) (fluid,1) (i,1) (am,1)(sure,1) (that,1) (that,1) (is,1)(true,1) (anyone,1) (can,1) (see,1)(what,1) (the,1) (position,1) (is,1)(the,1) (government,1)
R1
(first,1) (admiralty,1) (in,1) (his,1)(even,1) (farther,1) (he,1) (are,1)(always,1) (everything,1) (he,1)(assured,1) (is,1) (entirely,1)(fluid,1) (i,1) (am,1) (is,1)(anyone,1) (can,1) (is,1)(government,1) (cannot,1)(cannot,1) (get,1) (his,1) (go,1)(in,1) (decided,1) (be,1) (be,1)(irresolute,1) (adamant,1) (for,1)(drift,1) (for,1) (fluidity,1)(all-powerful,1) (be,1) (impotent,1)
M2
(simply,1) (cannot,1) (make,1)(up,1) (their,1) (minds,1) (or,1)(they,1) (cannot,1) (get,1) (the,1)(prime,1) (minister,1) (to,1)(make,1) (up,1) (his,1) (mind,1)(so,1) (they,1) (go,1) (on,1) (in,1)(strange,1) (paradox,1) (decided,1)(only,1) (to,1) (be,1) (undecided,1)(resolved,1) (to,1) (be,1)(irresolute,1) (adamant,1) (for,1)(drift,1) (solid,1) (for,1) (fluidity,1)(all-powerful,1) (to,1) (be,1)(impotent,1)
R2
(the,1) (lord,1) (of,1) (the,1)(speech,1) (the,1) (other,1)(night,1) (went,1) (said,1) (we,1)(reviewing,1) (the,1) (position,1)(us,1) (sure,1) (that,1) (that,1)(true,1) (see,1) (what,1) (the,1)(position,1) (the,1) (simply,1)(make,1) (up,1) (their,1) (minds,1)(or,1) (they,1) (the,1) (prime,1)(minister,1) (to,1) (make,1) (up,1)(mind,1) (so,1) (they,1) (on,1)(strange,1) (paradox,1) (only,1)(to,1) (undecided,1) (resolved,1)(to,1) (solid,1) (to,1)
P.J. McBrien (Imperial College London) Big Data: Pig Latin 9 / 44
MapReduce
MapReduce: Reduce Phase of Word Count
R1
(first,1) (admiralty,1) (in,1) (his,1)(even,1) (farther,1) (he,1) (are,1)(always,1) (everything,1) (he,1)(assured,1) (is,1) (entirely,1)(fluid,1) (i,1) (am,1) (is,1)(anyone,1) (can,1) (is,1)(government,1) (cannot,1)(cannot,1) (get,1) (his,1) (go,1)(in,1) (decided,1) (be,1) (be,1)(irresolute,1) (adamant,1) (for,1)(drift,1) (for,1) (fluidity,1)(all-powerful,1) (be,1) (impotent,1)
(adamant,1) (admiralty,1)(all-powerful,1) (always,1) (am,1)(anyone,1) (are,1) (assured,1) (be,3)(can,1) (cannot,2) (decided,1)(drift,1) (entirely,1) (even,1)(everything,1) (farther,1) (first,1)(fluid,1) (fluidity,1) (for,2) (get,1)(go,1) (government,1) (he,2) (his,2)(i,1) (impotent,1) (in,2)(irresolute,1) (is,3)
R2
(the,1) (lord,1) (of,1) (the,1)(speech,1) (the,1) (other,1)(night,1) (went,1) (said,1) (we,1)(reviewing,1) (the,1) (position,1)(us,1) (sure,1) (that,1) (that,1)(true,1) (see,1) (what,1) (the,1)(position,1) (the,1) (simply,1)(make,1) (up,1) (their,1) (minds,1)(or,1) (they,1) (the,1) (prime,1)(minister,1) (to,1) (make,1) (up,1)(mind,1) (so,1) (they,1) (on,1)(strange,1) (paradox,1) (only,1)(to,1) (undecided,1) (resolved,1)(to,1) (solid,1) (to,1)
(lord,1) (make,2) (mind,1)(minds,1) (minister,1) (night,1)(of,1) (on,1) (only,1) (or,1) (other,1)(paradox,1) (position,2) (prime,1)(resolved,1) (reviewing,1) (said,1)(see,1) (simply,1) (so,1) (solid,1)(speech,1) (strange,1) (sure,1)(that,2) (the,7) (their,1) (they,2)(to,4) (true,1) (undecided,1) (up,2)(us,1) (we,1) (went,1) (what,1)
P.J. McBrien (Imperial College London) Big Data: Pig Latin 10 / 44
MapReduce
MapReduce: Combine Phase on Map Nodes
Combine
Often (and in particular for aggregate operators on grouped data), the Reduceprocess may be partially calculated on the Map nodes. Such a partial Reduce processis called a Combine operations.
Operation Combine at Mi ReduceSum(B) Ci = Sum(Bi) Sum([C1, . . . , Cn])Count(B) Ci = Count(Bi) Sum([C1, . . . , Cn])Min(B) Ci = Min(Bi) Min([C1, . . . , Cn])
Applying Combine to the WordCount problem
Map phase identifies words from text
Combine phase counts the number of times each word appears on each Map node
Reduce phase sums per word the output of all Combine phases
P.J. McBrien (Imperial College London) Big Data: Pig Latin 11 / 44
MapReduce
MapReduce: Combine Phase of Word Count
M1
(the,1) (first,1) (lord,1) (of,1)(the,1) (admiralty,1) (in,1) (his,1)(speech,1) (the,1) (other,1)(night,1) (went,1) (even,1)(farther,1) (he,1) (said,1) (we,1)(are,1) (always,1) (reviewing,1)(the,1) (position,1) (everything,1)(he,1) (assured,1) (us,1) (is,1)(entirely,1) (fluid,1) (i,1) (am,1)(sure,1) (that,1) (that,1) (is,1)(true,1) (anyone,1) (can,1) (see,1)(what,1) (the,1) (position,1) (is,1)(the,1) (government,1)
(i,1) (am,1) (he,2) (in,1) (is,3) (of,1)(us,1) (we,1) (are,1) (can,1) (his,1)(see,1) (the,6) (even,1) (lord,1)(said,1) (sure,1) (that,2) (true,1)(went,1) (what,1) (first,1) (fluid,1)(night,1) (other,1) (always,1)(anyone,1) (speech,1) (assured,1)(farther,1) (entirely,1) (position,2)(admiralty,1) (reviewing,1)(everything,1) (government,1)
M2
(simply,1) (cannot,1) (make,1)(up,1) (their,1) (minds,1) (or,1)(they,1) (cannot,1) (get,1) (the,1)(prime,1) (minister,1) (to,1)(make,1) (up,1) (his,1) (mind,1)(so,1) (they,1) (go,1) (on,1) (in,1)(strange,1) (paradox,1) (decided,1)(only,1) (to,1) (be,1) (undecided,1)(resolved,1) (to,1) (be,1)(irresolute,1) (adamant,1) (for,1)(drift,1) (solid,1) (for,1) (fluidity,1)(all-powerful,1) (to,1) (be,1)(impotent,1)
(be,3) (go,1) (in,1) (on,1) (or,1)(so,1) (to,4) (up,2) (for,2) (get,1)(his,1) (the,1) (make,2) (mind,1)(only,1) (they,2) (drift,1) (minds,1)(prime,1) (solid,1) (their,1) (can-not,2) (simply,1) (adamant,1)(decided,1) (paradox,1) (strange,1)(fluidity,1) (impotent,1) (minis-ter,1) (resolved,1) (undecided,1)(irresolute,1) (all-powerful,1)
P.J. McBrien (Imperial College London) Big Data: Pig Latin 12 / 44
MapReduce
MapReduce: Reduce Phase of Word Count after Combine
R1
(admiralty,1) (always,1) (am,1)(anyone,1) (are,1) (assured,1)(can,1) (entirely,1) (even,1)(everything,1) (farther,1) (first,1)(fluid,1) (government,1) (he,2)(his,1) (i,1) (in,1) (is,3) (adamant,1)(all-powerful,1) (be,3) (cannot,2)(decided,1) (drift,1) (fluidity,1)(for,2) (get,1) (go,1) (his,1)(impotent,1) (in,1) (irresolute,1)
(adamant,1) (admiralty,1)(all-powerful,1) (always,1) (am,1)(anyone,1) (are,1) (assured,1) (be,3)(can,1) (cannot,2) (decided,1)(drift,1) (entirely,1) (even,1)(everything,1) (farther,1) (first,1)(fluid,1) (fluidity,1) (for,2) (get,1)(go,1) (government,1) (he,2) (his,2)(i,1) (impotent,1) (in,2)(irresolute,1) (is,3)
R2
(lord,1) (night,1) (of,1) (other,1)(position,2) (reviewing,1) (said,1)(see,1) (speech,1) (sure,1) (that,2)(the,6) (true,1) (us,1) (we,1)(went,1) (what,1) (make,2) (mind,1)(minds,1) (minister,1) (on,1)(only,1) (or,1) (paradox,1) (prime,1)(resolved,1) (simply,1) (so,1)(solid,1) (strange,1) (the,1) (their,1)(they,2) (to,4) (undecided,1) (up,2)
(lord,1) (make,2) (mind,1)(minds,1) (minister,1) (night,1)(of,1) (on,1) (only,1) (or,1) (other,1)(paradox,1) (position,2) (prime,1)(resolved,1) (reviewing,1) (said,1)(see,1) (simply,1) (so,1) (solid,1)(speech,1) (strange,1) (sure,1)(that,2) (the,7) (their,1) (they,2)(to,4) (true,1) (undecided,1) (up,2)(us,1) (we,1) (went,1) (what,1)
P.J. McBrien (Imperial College London) Big Data: Pig Latin 13 / 44
MapReduce
MapReduce Implementations: Hadoop Family
HDFS
Hadoop
HBase
Java Hive Pig
P.J. McBrien (Imperial College London) Big Data: Pig Latin 14 / 44
Pig Latin
Pig: Accessing Data
LOAD
The LOAD operator makes available a data source as a relation.
account.tsv
100[tab]current[tab]McBrien, P.[tab][tab]67101[tab]deposit[tab]McBrien, P.[tab]5.25[tab]67103[tab]current[tab]Boyd, M.[tab][tab]34107[tab]current[tab]Poulovassilis, A.[tab][tab]56119[tab]deposit[tab]Poulovassilis, A.[tab]5.50[tab]56125[tab]current[tab]Bailey, J.[tab][tab]56
Reading a TSV file
account =LOAD ’ / v o l /automed /data / bank branch / account . t s v ’AS ( no : i n t , t ype : cha ra r r ay , cname : cha ra r r ay , r a t e : f l o a t , s o r t c ode : i n t ) ;
P.J. McBrien (Imperial College London) Big Data: Pig Latin 15 / 44
Pig Latin
Running Pig Scripts
copy account.pig
account =LOAD ’ / v o l /automed/ data / bank branch / account . t s v ’AS ( no : i n t , t ype : cha ra r r a y , cname : cha ra r r a y , r a t e : f l o a t , s o r t c o d e : i n t ) ;
STORE account INTO ’ a ccoun t copy ’ USING PigSto rage ( ’ , ’ ) ;
Non-interactive
p i g −x l o c a l copy accoun t . p i g
P.J. McBrien (Imperial College London) Big Data: Pig Latin 16 / 44
Pig Latin
Running Pig Scripts
copy account.pig
account =LOAD ’ / v o l /automed/ data / bank branch / account . t s v ’AS ( no : i n t , t ype : cha ra r r a y , cname : cha ra r r a y , r a t e : f l o a t , s o r t c o d e : i n t ) ;
STORE account INTO ’ a ccoun t copy ’ USING PigSto rage ( ’ , ’ ) ;
Interactive
p i g −x l o c a lg runt>account =
LOAD ’ / v o l /automed/ data / bank branch / account . t s v ’AS ( no : i n t , t ype : cha ra r r a y , cname : cha ra r r a y , r a t e : f l o a t , s o r t c o d e : i n t ) ;
g runt>STORE account INTO ’ a ccoun t copy ’ USING PigSto rage ( ’ , ’ ) ;
Interactive: inspecting schemas and viewing results
p i g −x l o c a lg runt>account =
LOAD ’ / v o l /automed/ data / bank branch / account . t s v ’AS ( no : i n t , t ype : cha ra r r a y , cname : cha ra r r a y , r a t e : f l o a t , s o r t c o d e : i n t ) ;
g runt>DESCRIBE account ;g runt>DUMP account ;P.J. McBrien (Imperial College London) Big Data: Pig Latin 16 / 44
Pig Latin
Pig: Implementation of the RA
Project π
Select σ
Product ×
Join ⋊⋉
Union ∪
Difference −
accountno type cname rate? sortcode
100 ’current’ ’McBrien, P.’ NULL 67101 ’deposit’ ’McBrien, P.’ 5.25 67103 ’current’ ’Boyd, M.’ NULL 34107 ’current’ ’Poulovassilis, A.’ NULL 56119 ’deposit’ ’Poulovassilis, A.’ 5.50 56125 ’current’ ’Bailey, J.’ NULL 56
Project π
FOREACH 〈alias〉 GENERATE 〈colname〉,. . .Projects certain column names from an alias
πsortcode account
a c coun t s o r t c o d e b a g=FOREACH accountGENERATE s o r t c o d e ;
a c co un t s o r t c o d e=DISTINCT a c coun t s o r t c o d e b a g ;
P.J. McBrien (Imperial College London) Big Data: Pig Latin 17 / 44
Pig Latin
Pig: Implementation of the RA
Project π
Select σ
Product ×
Join ⋊⋉
Union ∪
Difference −
accountno type cname rate? sortcode
100 ’current’ ’McBrien, P.’ NULL 67101 ’deposit’ ’McBrien, P.’ 5.25 67103 ’current’ ’Boyd, M.’ NULL 34107 ’current’ ’Poulovassilis, A.’ NULL 56119 ’deposit’ ’Poulovassilis, A.’ 5.50 56125 ’current’ ’Bailey, J.’ NULL 56
Select σ
FILTER 〈alias〉 BY 〈predicate〉Only passes those tuples in 〈alias〉 that match the 〈predicate〉
σrate>0 account
a c c o u n t w i t h r a t e=FILTER accountBY r a t e >0.0;
P.J. McBrien (Imperial College London) Big Data: Pig Latin 17 / 44
Pig Latin
Pig: Implementation of the RA
Project π
Select σ
Product ×
Join ⋊⋉
Union ∪
Difference −
accountno type cname rate? sortcode
100 ’current’ ’McBrien, P.’ NULL 67101 ’deposit’ ’McBrien, P.’ 5.25 67103 ’current’ ’Boyd, M.’ NULL 34107 ’current’ ’Poulovassilis, A.’ NULL 56119 ’deposit’ ’Poulovassilis, A.’ 5.50 56125 ’current’ ’Bailey, J.’ NULL 56
Product ×
CROSS 〈alias〉,〈alias〉Produce the Cartesian product of two relations
branch× σrate>0 account
b r a n ch a c coun t w i t h r a t e =CROSS branch , a c c o u n t w i t h r a t e ;
P.J. McBrien (Imperial College London) Big Data: Pig Latin 17 / 44
Pig Latin
Pig: Implementation of the RA
Project π
Select σ
Product ×
Join ⋊⋉
Union ∪
Difference −
accountno type cname rate? sortcode
100 ’current’ ’McBrien, P.’ NULL 67101 ’deposit’ ’McBrien, P.’ 5.25 67103 ’current’ ’Boyd, M.’ NULL 34107 ’current’ ’Poulovassilis, A.’ NULL 56119 ’deposit’ ’Poulovassilis, A.’ 5.50 56125 ’current’ ’Bailey, J.’ NULL 56
Join ⋊⋉
JOIN 〈alias〉 BY 〈colname〉, 〈alias〉 BY 〈colname〉Perform a equi-join between two relations on the specified columns.
branch ⋊⋉ σrate>0 account
b r a n c h w i t h i n t e r e s t a c c o u n t =JOIN branch BY branch : : s o r t code ,
a c c o u n t w i t h r a t e BY a c c o u n t w i t h r a t e : : s o r t c o d e ;
P.J. McBrien (Imperial College London) Big Data: Pig Latin 17 / 44
Pig Latin
Pig: Implementation of the RA
Project π
Select σ
Product ×
Join ⋊⋉
Union ∪
Difference −
accountno type cname rate? sortcode
100 ’current’ ’McBrien, P.’ NULL 67101 ’deposit’ ’McBrien, P.’ 5.25 67103 ’current’ ’Boyd, M.’ NULL 34107 ’current’ ’Poulovassilis, A.’ NULL 56119 ’deposit’ ’Poulovassilis, A.’ 5.50 56125 ’current’ ’Bailey, J.’ NULL 56
Union ∪
UNION 〈alias〉,〈alias〉Perform a bag based union between two relations
πsortcode branch ∪ πno account
b r a n ch s o r t c o d e=FOREACH branchGENERATE s o r t c o d e ;
a ccoun t no=FOREACH accountGENERATE no ;
a l l i d s b a g=UNION b ranch so r t code , a ccoun t no
a l l i d s=DISTINCT a l l i d s b a g ;
P.J. McBrien (Imperial College London) Big Data: Pig Latin 17 / 44
Pig Latin
Pig: Implementation of the RA
Project π
Select σ
Product ×
Join ⋊⋉
Union ∪
Difference −
accountno type cname rate? sortcode
100 ’current’ ’McBrien, P.’ NULL 67101 ’deposit’ ’McBrien, P.’ 5.25 67103 ’current’ ’Boyd, M.’ NULL 34107 ’current’ ’Poulovassilis, A.’ NULL 56119 ’deposit’ ’Poulovassilis, A.’ 5.50 56125 ’current’ ’Bailey, J.’ NULL 56
Difference −
No direct implementation. Can achieve the same result by performing a LEFT join,and then eliminating rows with null values.
πno account− πno movement
account and movement=JOIN account BY no LEFT ,movement BY no ;
account wi thout movement=FILTER account and movementBY movement : : no IS NULL ;
account no wi thout movement=FOREACH account wi thout movementGENERATE no
P.J. McBrien (Imperial College London) Big Data: Pig Latin 17 / 44
Pig Latin
Quiz 1: Understanding Pig Scripts (1)
branchsortcode bname cash
56 ’Wimbledon’ 94340.4534 ’Goodge St’ 8900.6767 ’Strand’ 34005.00
accountno type cname rate? sortcode
100 ’current’ ’McBrien, P.’ NULL 67101 ’deposit’ ’McBrien, P.’ 5.25 67103 ’current’ ’Boyd, M.’ NULL 34107 ’current’ ’Poulovassilis, A.’ NULL 56119 ’deposit’ ’Poulovassilis, A.’ 5.50 56125 ’current’ ’Bailey, J.’ NULL 56
a = FILTER account BY t ype==’ c u r r e n t ’ ;ap = FOREACH a GENERATE no , s o r t c o d e ;
What is the value of ap in the Pig Script?
A
apno sortcode
100 67103 34107 56125 56
B
apno sortcode
100 67103 34107 56
C
apno sortcode
100 67107 56
D
apsortcode
67345656
P.J. McBrien (Imperial College London) Big Data: Pig Latin 18 / 44
Pig Latin
Quiz 2: Understanding Pig Scripts (2)
branchsortcode bname cash
56 ’Wimbledon’ 94340.4534 ’Goodge St’ 8900.6767 ’Strand’ 34005.00
accountno type cname rate? sortcode
100 ’current’ ’McBrien, P.’ NULL 67101 ’deposit’ ’McBrien, P.’ 5.25 67103 ’current’ ’Boyd, M.’ NULL 34107 ’current’ ’Poulovassilis, A.’ NULL 56119 ’deposit’ ’Poulovassilis, A.’ 5.50 56125 ’current’ ’Bailey, J.’ NULL 56
a = FILTER branch BY cash <50000;b = FILTER account BY t ype==’ d e p o s i t ’ ;ab = JOIN a BY s o r t code , b BY s o r t c o d e ;abp = FOREACH ab GENERATE a : : s o r t c o d e AS s o r t c o d e ;
What is the value of abp in the Pig Script?
A
abpsortcode
5667
B
abpsortcode
56
C
abpsortcode
67
D
abpsortcode
34
P.J. McBrien (Imperial College London) Big Data: Pig Latin 19 / 44
Pig Latin
Quiz 3: RA and Pig Equivalence
a = FILTER branch BY cash <50000;b = FILTER account BY t ype==’ d e p o s i t ’ ;ab = JOIN a BY s o r t code , b BY s o r t c o d e ;abp = FOREACH ab GENERATE a : : s o r t c o d e AS s o r t c o d e ;adpd = DISTINCT abp ;
Which RA expression is equivalent to abpd in the Pig Script?
A
πsortcode(σcash<50000 branch ∪ σtype=‘deposit’ account)
B
πsortcode(σcash<50000 branch ∩ σtype=‘deposit’ account)
C
πsortcode σcash<50000 branch ∪ πsortcode σtype=‘deposit’ account
D
πsortcode σcash<50000 branch ∩ πsortcode σtype=‘deposit’ account
P.J. McBrien (Imperial College London) Big Data: Pig Latin 20 / 44
Pig Latin
Worksheet: Translating RA to Pig
branchsortcode bname cash
56 ’Wimbledon’ 94340.4534 ’Goodge St’ 8900.6767 ’Strand’ 34005.00
movementmid no amount tdate1000 100 2300.00 5/1/19991001 101 4000.00 5/1/19991002 100 -223.45 8/1/19991004 107 -100.00 11/1/19991005 103 145.50 12/1/19991006 100 10.23 15/1/19991007 107 345.56 15/1/19991008 101 1230.00 15/1/19991009 119 5600.00 18/1/1999
accountno type cname rate? sortcode
100 ’current’ ’McBrien, P.’ NULL 67101 ’deposit’ ’McBrien, P.’ 5.25 67103 ’current’ ’Boyd, M.’ NULL 34107 ’current’ ’Poulovassilis, A.’ NULL 56119 ’deposit’ ’Poulovassilis, A.’ 5.50 56125 ’current’ ’Bailey, J.’ NULL 56
key branch(sortcode)key branch(bname)key movement(mid)key account(no)
movement(no)fk⇒ account(no)
account(sortcode)fk⇒ branch(sortcode)
1 πno movement
2 πcname,mid,amount σamount<0.0(account ⋊⋉ movement)
3 πsortcode branch− πsortcode σtype=‘deposit’
P.J. McBrien (Imperial College London) Big Data: Pig Latin 21 / 44
Pig Latin
Worksheet: Translating RA to Pig (1)
πno movement
movement no bag =FOREACH movementGENERATE no ;
movement no =DISTINCT movement no bag ;
P.J. McBrien (Imperial College London) Big Data: Pig Latin 22 / 44
Pig Latin
Worksheet: Translating RA to Pig (2)
πcname,mid,amount σamount<0.0(account ⋊⋉ movement)
wi thdrawa l =FILTER movementBY amount<0;
a c coun t w i t h w i t hd r awa l =JOIN account BY no ,
w i thdrawa l BY no ;
account and w i thdrawa l amount =FOREACH a c coun t w i t h w i t hd r awa lGENERATE cname , mid , amount ;
P.J. McBrien (Imperial College London) Big Data: Pig Latin 23 / 44
Pig Latin
Worksheet: Translating RA to Pig (3)
πsortcode branch− πsortcode σtype=‘deposit’
d e p o s i t =FILTER accountBY type==’ d e p o s i t ’ ;
b ranch account =JOIN branch BY s o r t c ode LEFT ,
d e p o s i t BY s o r t c ode ;
b r a n c h e s w i t h o u t d e p o s i t =FILTER branch accountBY no IS NULL ;
s o r t c o d e s w i t h o u t d e p o s i t =FOREACH b r a n c h e s w i t h o u t d e p o s i tGENERATE branch : : s o r t c ode AS s o r t c ode ;
P.J. McBrien (Imperial College London) Big Data: Pig Latin 24 / 44
Pig Latin
Relations as attributes: GROUP and FLATTEN
movement =LOAD ’ / v o l / automed/ data / bank branch /movement . t s v ’AS (mid : i n t , no : i n t , amount : double , t d a t e : b y te a r r a y ) ;
movementmid no amount tdate1000 100 2300.00 5/1/19991001 101 4000.00 5/1/19991002 100 -223.45 8/1/19991004 107 -100.00 11/1/19991005 103 145.50 12/1/19991006 100 10.23 15/1/19991007 107 345.56 15/1/19991008 101 1230.00 15/1/19991009 119 5600.00 18/1/1999
P.J. McBrien (Imperial College London) Big Data: Pig Latin 25 / 44
Pig Latin
Relations as attributes: GROUP and FLATTEN
movement =LOAD ’ / v o l / automed/ data / bank branch /movement . t s v ’AS (mid : i n t , no : i n t , amount : double , t d a t e : b y te a r r a y ) ;
account movements =GROUP movementBY no ;
account movementsgroup movement100 {〈1000,100,2300.0,1999-01-05〉,〈1002,100,-223.45,1999-01-08〉,〈1006,100,10.23,1999-01-15〉}101 {〈1001,101,4000.0,1999-01-05〉,〈1008,101,1230.0,1999-01-15〉}103 {〈1005,103,145.5,1999-01-12〉}107 {〈1004,107,-100.0,1999-01-11〉,〈1007,107,345.56,1999-01-15〉}119 {〈1009,119,5600.0,1999-01-18〉}
P.J. McBrien (Imperial College London) Big Data: Pig Latin 25 / 44
Pig Latin
Relations as attributes: GROUP and FLATTEN
movement =LOAD ’ / v o l / automed/ data / bank branch /movement . t s v ’AS (mid : i n t , no : i n t , amount : double , t d a t e : b y te a r r a y ) ;
account movements =GROUP movementBY no ;
movement copy =FOREACH account movementsGENERATE FLATTEN(movement ) ;
movement copymid no amount tdate1000 100 2300.00 5/1/19991001 101 4000.00 5/1/19991002 100 -223.45 8/1/19991004 107 -100.00 11/1/19991005 103 145.50 12/1/19991006 100 10.23 15/1/19991007 107 345.56 15/1/19991008 101 1230.00 15/1/19991009 119 5600.00 18/1/1999
P.J. McBrien (Imperial College London) Big Data: Pig Latin 25 / 44
Pig Latin
Relations as attributes: GROUP and FLATTEN
movement =LOAD ’ / v o l / automed/ data / bank branch /movement . t s v ’AS (mid : i n t , no : i n t , amount : double , t d a t e : b y te a r r a y ) ;
account movements =GROUP movementBY no ;
a c co un t b a l a n c e =FOREACH account movementsGENERATE group AS no ,
SUM(movement . amount ) AS ba l ance ;
account balanceno balance100 2086.78101 5230.00103 145.50107 245.56119 5600.00
P.J. McBrien (Imperial College London) Big Data: Pig Latin 25 / 44
Pig Latin
Aggregates Operators in Pig
Pig Operators over Bags of Data
Function Resultint COUNT(bag) Returns the number of not null values in the bag.int COUNT STAR(bag) Returns the number of values in the bag (including any
null values).double AVG(bag) Returns the average of values in the bag.double MAX(bag) Returns the maximum value in the bag.double MIN(bag) Returns the minimum value in the bag.double SUM(bag) Returns the sum of values in the bag.bag DIFF(bag a,bag b) Returns those tuples in a that do not appear in b
To achieve the equivalent of SQL’s GROUP BY and use of aggregate operators:
Use GROUP to build a bag of tuples for each value in the group
Apply a Pig aggregate operator to the bag
P.J. McBrien (Imperial College London) Big Data: Pig Latin 26 / 44
Pig Latin
Quiz 4: Understanding Pig Scripts (3)
accountno type cname rate? sortcode
100 ’current’ ’McBrien, P.’ NULL 67101 ’deposit’ ’McBrien, P.’ 5.25 67103 ’current’ ’Boyd, M.’ NULL 34107 ’current’ ’Poulovassilis, A.’ NULL 56119 ’deposit’ ’Poulovassilis, A.’ 5.50 56125 ’current’ ’Bailey, J.’ NULL 56
movementmid no amount tdate1000 100 2300.00 5/1/19991001 101 4000.00 5/1/19991002 100 -223.45 8/1/19991004 107 -100.00 11/1/19991005 103 145.50 12/1/19991006 100 10.23 15/1/19991007 107 345.56 15/1/19991008 101 1230.00 15/1/19991009 119 5600.00 18/1/1999
ab = JOIN account BY no LEFT , movement BY no ;abg = GROUP ab BY account : : no ;abr = FOREACH abg GENERATE group ,COUNT( ab . movement : : no ) AS no mv ;
What is the value of abr in the Pig Script?
A
abrgroup no mv100 1101 1103 1107 1119 1125 0
B
abrgroup no mv100 3101 2103 1107 2119 1125 1
C
abrgroup no mv100 3101 2103 1107 2119 1125 0
D
abrgroup no mv100 3101 2103 1107 2119 1
P.J. McBrien (Imperial College London) Big Data: Pig Latin 27 / 44
Pig Latin
Optimisation of Scripts: Project Early
movement =LOAD ’ / v o l / automed/ data / bank branch /movement . t s v ’AS (mid : i n t , no : i n t , amount : double , t d a t e : b y te a r r a y ) ;
movementmid no amount tdate1000 100 2300.00 5/1/19991001 101 4000.00 5/1/19991002 100 -223.45 8/1/19991004 107 -100.00 11/1/19991005 103 145.50 12/1/19991006 100 10.23 15/1/19991007 107 345.56 15/1/19991008 101 1230.00 15/1/19991009 119 5600.00 18/1/1999P.J. McBrien (Imperial College London) Big Data: Pig Latin 28 / 44
Pig Latin
Optimisation of Scripts: Project Early
movement =LOAD ’ / v o l / automed/ data / bank branch /movement . t s v ’AS (mid : i n t , no : i n t , amount : double , t d a t e : b y te a r r a y ) ;
movement data =FOREACH movementGENERATE no , amount ;
account movements =GROUP movement dataBY no ;
account movementsgroup movement data100 {〈100,2300.0〉,〈100,-223.45〉,〈100,10.23〉}101 {〈101,4000.0〉,〈101,1230.0〉}103 {〈1103,145.5,〉}107 {〈107,-100.0〉,〈107,345.56〉}119 {〈119,5600.0〉}
P.J. McBrien (Imperial College London) Big Data: Pig Latin 28 / 44
Pig Latin
Optimisation of Scripts: Project Early
movement =LOAD ’ / v o l / automed/ data / bank branch /movement . t s v ’AS (mid : i n t , no : i n t , amount : double , t d a t e : b y te a r r a y ) ;
movement data =FOREACH movementGENERATE no , amount ;
account movements =GROUP movement dataBY no ;
movement pro ject =FOREACH account movementsGENERATE FLATTEN(movement ) ;
movement projectno amount100 2300.00101 4000.00100 -223.45107 -100.00103 145.50100 10.23107 345.56101 1230.00119 5600.00P.J. McBrien (Imperial College London) Big Data: Pig Latin 28 / 44
Pig Latin
Optimisation of Scripts: Project Early
movement =LOAD ’ / v o l / automed/ data / bank branch /movement . t s v ’AS (mid : i n t , no : i n t , amount : double , t d a t e : b y te a r r a y ) ;
movement data =FOREACH movementGENERATE no , amount ;
account movements =GROUP movement dataBY no ;
a c co un t b a l a n c e =FOREACH account movementsGENERATE group AS no ,
SUM(movement . amount ) AS ba l ance ;
account balanceno balance100 2086.78101 5230.00103 145.50107 245.56119 5600.00
P.J. McBrien (Imperial College London) Big Data: Pig Latin 28 / 44
Pig Latin
Nested Statements
SQL Query to find total of credits and of debits
SELECT account . no ,COUNT(movement . mid ) AS no t ran s ,SUM(CASE WHEN amount>0.0 THEN amount ELSE 0 .0 END) AS c r e d i t ,SUM(CASE WHEN amount<0.0 THEN amount ELSE 0 .0 END) AS d e b i t
FROM account LEFT JOIN movement ON account . no=movement . noGROUP BY account . no
Pig Script to find total of credits and of debits
account and movement =JOIN account BY no LEFT ,
movement BY no ;a c c o u n t d e t a i l =
GROUP account and movement BY account : : no ;a c c o u n t c r e d i t s a n d d e b i t s =
FOREACH a c c o u n t d e t a i l {c r e d i t =
FILTER account and movementBY amount>0.0;
d e b i t =FILTER account and movementBY amount<0.0;
GENERATE group AS no ,COUNT( account and movement ) AS no t r an s ,SUM( c r e d i t . amount ) AS c r e d i t ,SUM( d e b i t . amount ) AS d e b i t ;
}
P.J. McBrien (Imperial College London) Big Data: Pig Latin 29 / 44
Pig Latin
Worksheet: Translating SQL to Pig
branchsortcode bname cash
56 ’Wimbledon’ 94340.4534 ’Goodge St’ 8900.6767 ’Strand’ 34005.00
movementmid no amount tdate1000 100 2300.00 5/1/19991001 101 4000.00 5/1/19991002 100 -223.45 8/1/19991004 107 -100.00 11/1/19991005 103 145.50 12/1/19991006 100 10.23 15/1/19991007 107 345.56 15/1/19991008 101 1230.00 15/1/19991009 119 5600.00 18/1/1999
accountno type cname rate? sortcode
100 ’current’ ’McBrien, P.’ NULL 67101 ’deposit’ ’McBrien, P.’ 5.25 67103 ’current’ ’Boyd, M.’ NULL 34107 ’current’ ’Poulovassilis, A.’ NULL 56119 ’deposit’ ’Poulovassilis, A.’ 5.50 56125 ’current’ ’Bailey, J.’ NULL 56
key branch(sortcode)key branch(bname)key movement(mid)key account(no)
movement(no)fk⇒ account(no)
account(sortcode)fk⇒ branch(sortcode)
P.J. McBrien (Imperial College London) Big Data: Pig Latin 30 / 44
Pig Latin
Worksheet: Translating SQL to Pig (1)
SELECT branch . bname ,account . no
FROM branchJOIN account ON branch . s o r t c o d e=account . s o r t c o d eJOIN movement ON account . no=movement . no
WHERE movement . amount<0
wi thd rawa l =FILTER movementBY amount<0;
a c c oun t w i t h w i t h d r awa l =JOIN account BY no ,
w i thd rawa l BY no ;
b r a n c h w i t h w i t h d r awa l =JOIN a c c oun t w i t h w i t h d r awa l BY so r t c ode ,
branch BY s o r t c od e ;
b r a n c h w i t h w i t h d r awa l n o =FOREACH b r an c h w i t h w i t h d r awa lGENERATE bname , account : : no ;
P.J. McBrien (Imperial College London) Big Data: Pig Latin 31 / 44
Pig Latin
Worksheet: Translating SQL to Pig (2)
SELECT DISTINCT branch . bname ,account . no
FROM branchJOIN account ON branch . s o r t c o d e=account . s o r t c o d eJOIN movement ON account . no=movement . no
WHERE movement . amount<0
wi thd raw l =FILTER movementBY amount<0;
w i thd raw l ac c ount bag =FOREACH wi thd raw lGENERATE no ;
w i thd raw l ac c ount =DISTINCT wi thd raw l ac count bag ;
a c c oun t w i t h w i t h d r aw l =JOIN account BY no ,
w i thd raw l ac c ount BY no ;
b r a n c h w i t h w i t h d r aw l =JOIN a c c oun t w i t h w i t h d r aw l BY so r t c ode ,
branch BY s o r t c od e ;
b r a n c h w i t h w i t h d r aw l n o =FOREACH b r an c h w i t h w i t h d r aw lGENERATE bname , account : : no ;
P.J. McBrien (Imperial College London) Big Data: Pig Latin 32 / 44
Pig Latin
Worksheet: Translating SQL to Pig (3)
SELECT account . cname ,SUM(movement . amount ) AS ba l ance
FROM accountLEFT JOIN movement ON account . no=movement . no
GROUP BY account . cname
account movement =JOIN account BY no LEFT , movement BY no ;
c u s t ome r d e t a i l s =GROUP account movement BY account : : cname ;
c us tome r ba l anc e =FOREACH c u s t ome r d e t a i l sGENERATE group AS cname , SUM( account movement . movement : : amount ) AS ba l anc e ;
P.J. McBrien (Imperial College London) Big Data: Pig Latin 33 / 44
Pig Latin
Worksheet: Translating SQL to Pig (3) Optimised
SELECT account . cname ,SUM(movement . amount ) AS ba l ance
FROM accountLEFT JOIN movement ON account . no=movement . no
GROUP BY account . cname
ac count movement j o i n =JOIN account BY no LEFT , movement BY no ;
account movement =FOREACH ac count movement j o i nGENERATE cname , amount ;
c u s t ome r d e t a i l s =GROUP account movement BY account : : cname ;
c us tome r ba l anc e =FOREACH c u s t ome r d e t a i l sGENERATE group AS cname , SUM( account movement . movement : : amount ) AS ba l anc e ;
P.J. McBrien (Imperial College London) Big Data: Pig Latin 34 / 44
Pig Latin
Worksheet: Translating SQL to Pig (4)
SELECT branch . s o r t code ,branch . bname ,COUNT(CASE WHEN type=’ c u r r e n t ’ THEN no ELSE NULL END) AS cu r r en t ,COUNT(CASE WHEN type=’ d e p o s i t ’ THEN no ELSE NULL END) AS d e p o s i t
FROM account JOIN branch ON account . s o r t c o d e=branch . s o r t c o d eGROUP BY branch . s o r t code , branch . bnameORDER BY branch . s o r t code , branch . bname
branch ac count =JOIN branch BY so r t c ode , account BY s o r t c od e ;
b r a n c h d e t a i l =GROUP branch ac count BY ( branch : : so r t c ode , branch : : bname ) ;
b r a n c h a c c oun t t y p e s =FOREACH b r a n c h d e t a i l {
c u r r e n t =FILTER branch ac countBY t ype == ’ c u r r e n t ’ ;
d e po s i t =FILTER branch ac countBY t ype == ’ d e po s i t ’ ;
GENERATE group . s o r t c od e AS so r t c ode ,group . bname AS bname ,COUNT( c u r r e n t . no ) AS cu r r e n t ,COUNT( d e po s i t . no ) AS d e po s i t ;
}b r an c h a c c oun t t y p e s o r d e r e d =
ORDER b r an c h a c c oun t t y p e sBY so r t c ode , bname ;
P.J. McBrien (Imperial College London) Big Data: Pig Latin 35 / 44
Pig Latin
SQL WHERE and HAVING
SELECT account . cname ,SUM(movement . amount ) AS ba l ance
FROM accountJOIN movement ON account . no=movement . no
WHERE ABS(movement . amount)>100GROUP BY account . cnameHAVING SUM(movement . amount)>200
ac count movement j o i n =JOIN account BY no , movement BY no ;
ac c ount movement l a r ge =FILTER ac count movement j o i nBY ABS( amount )>100;
account movement =FOREACH ac count movement l a r geGENERATE cname , amount ;
c u s t ome r d e t a i l s =GROUP account movement BY account : : cname ;
c u s t om e r b a l a n c e a l l =FOREACH c u s t ome r d e t a i l sGENERATE group AS cname , SUM( account movement . movement : : amount ) AS ba l anc e ;
c u s t ome r b a l a n c e l a r g e =FILTER c u s t om e r b a l a n c e a l lBY balance >200;
P.J. McBrien (Imperial College London) Big Data: Pig Latin 36 / 44
Pig Execution
Pig to Hadoop Translation
Pig scripts are interpreted into a sequence of Hadoop Map, Combine, Shuffle, andReduce operations.
In general, a Pig script may require multiple MapReduce processes to be run.
Map and Combine processes run on nodes containing data.
Number of Reduce nodes used specified in the Pig script (and defaults to 1!)
Temporary files are used to allow output of one MapReduce process to be fedback as input to another MapReduce process.
Projects (from GENERATE in Pig) are automatically pushed inside Joins, butotherwise little optimisation is performed by the Pig interpreter.
P.J. McBrien (Imperial College London) Big Data: Pig Latin 37 / 44
Pig Execution
Quiz 5: Pig Operations in MapReduce
Which Pig Operator may be executed entirely on a Map Process?
A
JOIN
B
DISTINCT
C
GENERATE
D
UNION
P.J. McBrien (Imperial College London) Big Data: Pig Latin 38 / 44
Pig Execution
Pig Operators in MapReduce
Translation of Pig Operators to MapReduce
Pig Operator Map or ReduceFILTER R BY A == val MapFOREACH R GENERATE A,B, . . . MapCROSS R,S ReduceGROUP R BY A Combine,ReduceJOIN R BY A, S BY B ReduceJOIN R BY A LEFT OUTER, S BY B; ReduceJOIN R BY A RIGHT OUTER, S BY B; ReduceUNION R,S Reduce
Parallelism in Reduce Operators
Control number of reduce nodes by a PARALLEL option at the end of reduceoperator.
Default is the have one reduce node.
P.J. McBrien (Imperial College London) Big Data: Pig Latin 39 / 44
Pig Execution
Worksheet: Translating Pig to MapReduce
o rgan i s a t i on and membe r s =JOIN o r g a n i z a t i o n BY a b b r e v i a t i o n ,
i s member BY o r g a n i z a t i o n ;
o r g a n i s a t i o n a n d c o u n t r i e s =JOIN o rgan i s a t i on and membe r s BY i s member : : count ry ,
c oun t r y BY code ;
o r g a n i s a t i o n d a t a =FOREACH o r g a n i s a t i o n a n d c o u n t r i e sGENERATE a b b r e v i a t i o n ,
area ,p opu l a t i on ;
o r g a n i s a t i o n g r o u p e d =GROUP o r g a n i s a t i o n d a t a BY ( a b b r e v i a t i o n ) ;
o r g a n i s a t i o n a g g r e g a t e s =FOREACH o r g a n i s a t i o n g r o u p e dGENERATE group AS a b b r e v i a t i o n ,
COUNT( o r g a n i s a t i o n d a t a . a b b r e v i a t i o n ) AS no ,SUM( o r g a n i s a t i o n d a t a . a r e a ) AS area ,SUM( o r g a n i s a t i o n d a t a . p opu l a t i on ) AS popu l a t i on ;
o r g a n i s a t i o n b i g o r g a n i s a t i o n s =FILTER o r g a n i s a t i o n a g g r e g a t e sBY no>=50;
o r g a n i s a t i o n r e s u l t =FOREACH o r g a n i s a t i o n b i g o r g a n i s a t i o n sGENERATE a b b r e v i a t i o n ,
area ,p opu l a t i on ;
P.J. McBrien (Imperial College London) Big Data: Pig Latin 40 / 44
Pig Execution
Worksheet: Translating Pig to MapReduce
map1.1 = πcountry,organization is membermap1.2 = πabbreviation organization
map2 = πcode,area,population coutry
combine3 = Γabbreviation,count(abbreviation),sum(area),sum(population) reduce2
reduce1 = πabbreviation,country(map1.1abbreviation = organisation
⋊⋉ map1.2)
reduce2 = πabbreviation,area,population(reduce1contry = code
⋊⋉ map2)
reduce3 =πabbreviation,area sum as area,population sum as population
σabbreviation count>50
Γabbreviation,sum(abbreviation count),sum(area sum),sum(population sum)
(combine3)
P.J. McBrien (Imperial College London) Big Data: Pig Latin 41 / 44
Pig Joins
Types of Join: Distributed Hash Join
M1r1
M2r2
M3r3
M4s1
M5s2
R1
h(r.a) ∈ K1
h(s.b) ∈ K1
R2
h(r.a) ∈ K2
h(s.b) ∈ K2
R3
h(r.a) ∈ K3
h(s.b) ∈ K3
③
❘
❘
❃
❘
❘
✒
✲
❘
✒
✒
⑦
✒
✒
✿
mapnodes shuffle reduce nodes
Default implementation of Join
t u = JOIN r BY a , s BY b
Standard JOIN will use a shuffle to distributethe tables of the join over the reduce nodes
uses the Java hashCode method
P.J. McBrien (Imperial College London) Big Data: Pig Latin 42 / 44
Pig Joins
Types of Join: Replicated Join
M1r1
M2r2
M3r3
M4r4
M5s1
✛
✛
✛
✛
mapnodes
replicate
Replicated Joins
t u = JOIN r BY a , s BY b USING ’ r e p l i c a t e d ’
JOIN with the replicated option causes the entireright hand table to be copied onto the all mapnodes holding the left hand table.
replicated joins executed as a Map process.
P.J. McBrien (Imperial College London) Big Data: Pig Latin 42 / 44
Pig Joins
Quiz 6: Pig Replicated Joins
branchsortcode bname cash
56 ’Wimbledon’ 94340.4534 ’Goodge St’ 8900.6767 ’Strand’ 34005.00
accountno type cname rate? sortcode
100 ’current’ ’McBrien, P.’ NULL 67101 ’deposit’ ’McBrien, P.’ 5.25 67103 ’current’ ’Boyd, M.’ NULL 34107 ’current’ ’Poulovassilis, A.’ NULL 56119 ’deposit’ ’Poulovassilis, A.’ 5.50 56125 ’current’ ’Bailey, J.’ NULL 56
The size of branch is such it easily fits on one node, whilst account does not.
Which Pig Script is invalid?
A
ba = JOIN account BY sortcode, branch BY sortcode;
B
ba = JOIN account BY sortcode RIGHT, branch BY sortcode USING ’replicated’;
C
ba = JOIN account BY sortcode LEFT, branch BY sortcode USING ’replicated’;
D
ba = JOIN account BY sortcode, branch BY sortcode USING ’replicated’;P.J. McBrien (Imperial College London) Big Data: Pig Latin 43 / 44
Pig Joins
Types of Join: Skewed Join
M1s1
M2s2
M3s3
M4r1
M5r2
R1
r.a = K1
s.b = K1
R2
some r.a = K2
s.b = K2
R3
some r.a = K2
s.b = K2
R4
r.a = K3
s.b = K3
③
❘
❘
❘
❃
⑦
❘
❘
✒
✿
⑦
❘
✒
✒
✿
❥
✒
✣
✒
✿
mapnodes shuffle reduce nodes
Join optimised for skewed distribution of keys
t u = JOIN r BY a , s BY b USING ’ skewed ’
Skewed join first generates a histogram of thefrequency of various join key in r
Histogram use to distribute the tables of thejoin over the reduce nodes. For keys with highfrequency in r:
rows of r distributed in round robin fashionrows of s duplicated
P.J. McBrien (Imperial College London) Big Data: Pig Latin 44 / 44
Pig Joins
Types of Join: Merge Join
M1r3
M2r2
M3r1
M4s1
M5s2
M6s3
✛
✛✛
✛✛
mapnodes
load
Merge Joins
t u = JOIN r BY a , s BY b USING ’ merge ’
A version of Sort-Merge join where it is assumedboth inputs are already sorted.
First record of each block of s sampled todetermine layout
Maps nodes of r load s blocks as required.
P.J. McBrien (Imperial College London) Big Data: Pig Latin 44 / 44
Pig Joins
Quiz 7: Pig Join Type Selection
web log(timestamp, url, ip address, size)
firewall log(timestamp, ip address, status)
Suppose the two logs have data created in timestamp order, and the following Pig script isto be executed:
s u s p e c t l o g =FILTER f i r e w a l l l o gBY s t a t u s == ’S ’ ;
s u s p e c t f e t c h =JOIN web log BY t imestamp , s u s p e c t l o g BY t imestamp ;
Which Pig JOIN option is best suited to the above dataset?
A
default (Hash Join)
B
replicated
C
merge
D
skewed
P.J. McBrien (Imperial College London) Big Data: Pig Latin 45 / 44
top related