utkarsh srivastava pig : building high-level dataflows over map-reduce research & cloud...

46
Utkarsh Srivastava Pig : Building High- Level Dataflows over Map-Reduce Research & Cloud Computing

Post on 21-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Utkarsh Srivastava

Pig : Building High-Level Dataflows over Map-Reduce

Pig : Building High-Level Dataflows over Map-Reduce

Research &Cloud Computing

Page 2: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Data Processing Renaissance

Internet companies swimming in data• E.g. TBs/day at Yahoo!

Data analysis is “inner loop” of product innovation

Data analysts are skilled programmers

Page 3: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Data Warehousing …?

ScaleScale Often not scalable enough

$ $ $ $$ $ $ $Prohibitively expensive at web scale• Up to $200K/TB

SQLSQL• Little control over execution method• Query optimization is hard• Parallel environment• Little or no statistics• Lots of UDFs

Page 4: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

New Systems For Data Analysis

Map-Reduce

Apache Hadoop

Dryad

. . .

Page 5: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Map-Reduce

Inputrecords

k1 v1

k2 v2

k1 v3

k2 v4

k1 v5

mapmap

mapmap

k1 v1

k1 v3

k1 v5

k2 v2

k2 v4

Outputrecords

reducereduce

reducereduce

Just a group-by-aggregate?Just a group-by-aggregate?

Page 6: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

The Map-Reduce Appeal

ScaleScaleScalable due to simpler design• Only parallelizable operations• No transactions

$ $ Runs on cheap commodity hardware

Procedural Control- a processing “pipe”SQL SQL

Page 7: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Disadvantages

1. Extremely rigid data flow

Other flows constantly hacked in

Join, Union Split

MM RR

MM MM RR MM

Chains

2. Common operations must be coded by hand• Join, filter, projection, aggregates, sorting, distinct

3. Semantics hidden inside map-reduce functions• Difficult to maintain, extend, and optimize

Page 8: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Pros And Cons

Need a high-level, general data flow language

Page 9: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Enter Pig Latin

Pig LatinPig Latin

Need a high-level, general data flow language

Page 10: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Outline

• Map-Reduce and the need for Pig Latin

• Pig Latin

• Compilation into Map-Reduce

• Example Generation

• Future Work

Page 11: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Example Data Analysis Task

User Url Time

Amy cnn.com 8:00

Amy bbc.com 10:00

Amy flickr.com 10:05

Fred cnn.com 12:00

Find the top 10 most visited pages in each category

Url Category PageRank

cnn.com News 0.9

bbc.com News 0.8

flickr.com Photos 0.7

espn.com Sports 0.9

Visits Url Info

Page 12: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Data Flow

Load VisitsLoad Visits

Group by urlGroup by url

Foreach urlgenerate count

Foreach urlgenerate count Load Url InfoLoad Url Info

Join on urlJoin on url

Group by categoryGroup by category

Foreach categorygenerate top10 urls

Foreach categorygenerate top10 urls

Page 13: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

In Pig Latinvisits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(visits);

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);visitCounts = join visitCounts by url, urlInfo by url;

gCategories = group visitCounts by category;topUrls = foreach gCategories generate top(visitCounts,10);

store topUrls into ‘/data/topUrls’;

Page 14: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Step-by-step Procedural ControlTarget users are entrenched procedural programmers

The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data.

The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data.

Jasmine NovakEngineer, Yahoo!

• Automatic query optimization is hard • Pig Latin does not preclude optimization

With the various interleaved clauses in SQL, it is difficult to know what is actually happening sequentially. With Pig, the data nesting and the temporary tables get abstracted away. Pig has fewer primitives than SQL does, but it’s more powerful.

With the various interleaved clauses in SQL, it is difficult to know what is actually happening sequentially. With Pig, the data nesting and the temporary tables get abstracted away. Pig has fewer primitives than SQL does, but it’s more powerful.

David CiemiewiczSearch Excellence, Yahoo!

Page 15: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

visits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(urlVisits);

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);

visitCounts = join visitCounts by url, urlInfo by url;gCategories = group visitCounts by category;topUrls = foreach gCategories generate top(visitCounts,10);

store topUrls into ‘/data/topUrls’;

Quick Start and Interoperability

Operates directly over filesOperates directly over files

Page 16: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

visits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(urlVisits);

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);

visitCounts = join visitCounts by url, urlInfo by url;gCategories = group visitCounts by category;topUrls = foreach gCategories generate top(visitCounts,10);

store topUrls into ‘/data/topUrls’;

Quick Start and Interoperability

Schemas optional; Can be assigned dynamically

Schemas optional; Can be assigned dynamically

Page 17: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

visits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(urlVisits);

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);

visitCounts = join visitCounts by url, urlInfo by url;gCategories = group visitCounts by category;topUrls = foreach gCategories generate top(visitCounts,10);

store topUrls into ‘/data/topUrls’;

User-Code as a First-Class Citizen

User-defined functions (UDFs) can be used in every construct• Load, Store• Group, Filter, Foreach

User-defined functions (UDFs) can be used in every construct• Load, Store• Group, Filter, Foreach

Page 18: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

• Pig Latin has a fully-nestable data model with:– Atomic values, tuples, bags (lists), and maps

• More natural to programmers than flat tuples• Avoids expensive joins

Nested Data Model

yahoo ,financeemailnews

Page 19: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

• Common case: aggregation on these nested sets• Power users: sophisticated UDFs, e.g., sequence analysis• Efficient Implementation (see paper)

Nested Data Model

Decouples grouping as an independent operationUser Url Time

Amy cnn.com 8:00

Amy bbc.com 10:00

Amy bbc.com 10:05

Fred cnn.com 12:00

group Visits

cnn.comAmy cnn.com 8:00

Fred cnn.com 12:00

bbc.comAmy bbc.com 10:00

Amy bbc.com 10:05

group by url

I frankly like pig much better than SQL in some respects (group + optional flatten works better for me, I love nested data structures).”

I frankly like pig much better than SQL in some respects (group + optional flatten works better for me, I love nested data structures).”

Ted DunningChief Scientist, Veoh19

Page 20: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

CoGroup

query url rank

Lakers nba.com 1

Lakers espn.com 2

Kings nhl.com 1

Kings nba.com 2

query adSlot amount

Lakers top 50

Lakers side 20

Kings top 30

Kings side 10

group results revenue

LakersLakers nba.com 1 Lakers top 50

Lakers espn.com 2 Lakers side 20

KingsKings nhl.com 1 Kings top 30

Kings nba.com 2 Kings side 10

results revenue

Cross-product of the 2 bags would give natural join

Page 21: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Outline

• Map-Reduce and the need for Pig Latin

• Pig Latin

• Compilation into Map-Reduce

• Example Generation

• Future Work

Page 22: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Implementation

cluster

Hadoop Map-Reduce

Hadoop Map-Reduce

PigPig

SQL

automaticrewrite +optimize

or

or

user

Pig is open-source.http://hadoop.apache.org/pig

Pig is open-source.http://hadoop.apache.org/pig

• ~50% of Hadoop jobs at Yahoo! are Pig• 1000s of jobs per day

Page 23: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Compilation into Map-Reduce

Load VisitsLoad Visits

Group by urlGroup by url

Foreach urlgenerate count

Foreach urlgenerate count Load Url InfoLoad Url Info

Join on urlJoin on url

Group by categoryGroup by category

Foreach categorygenerate top10(urls)

Foreach categorygenerate top10(urls)

Map1

Reduce1Map2

Reduce2

Map3

Reduce3

Every group or join operation forms a map-reduce boundary

Other operations pipelined into map and reduce phases

Page 24: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Optimizations: Using the Combiner

Inputrecords

k1 v1

k2 v2

k1 v3

k2 v4

k1 v5

mapmap

mapmap

k1 v1

k1 v3

k1 v5

k2 v2

k2 v4

Outputrecords

reducereduce

reducereduce

Can pre-process data on the map-side to reduce data shipped• Algebraic Aggregation Functions• Distinct processing

Page 25: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Optimizations: Skew Join

• Default join method is symmetric hash join.

group results revenue

LakersLakers nba.com 1 Lakers top 50

Lakers espn.com 2 Lakers side 20

KingsKings nhl.com 1 Kings top 30

Kings nba.com 2 Kings side 10

cross product carried out on 1 reducer

• Problem if too many values with same key• Skew join samples data to find frequent values• Further splits them among reducers

Page 26: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Optimizations: Fragment-Replicate Join

• Symmetric-hash join repartitions both inputs

• If size(data set 1) >> size(data set 2)– Just replicate data set 2 to all partitions of data set 1

• Translates to map-only job– Open data set 2 as “side file”

Page 27: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Optimizations: Merge Join

• Exploit data sets are already sorted.

• Again, a map-only job– Open other data set as “side file”

Page 28: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Optimizations: Multiple Data Flows

Load UsersLoad Users

Filter botsFilter bots

Group by state

Group by state

Apply udfsApply udfs

Store into ‘bystate’

Store into ‘bystate’

Group by demographic

Group by demographic

Apply udfsApply udfs

Store into ‘bydemo’

Store into ‘bydemo’

Map1

Reduce1

Page 29: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Optimizations: Multiple Data Flows

Load UsersLoad Users

Filter botsFilter bots

Group by state

Group by state

Apply udfsApply udfs

Store into ‘bystate’

Store into ‘bystate’

Group by demographic

Group by demographic

Apply udfsApply udfs

Store into ‘bydemo’

Store into ‘bydemo’

SplitSplit

DemultiplexDemultiplex

Map1

Reduce1

Page 30: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Other Optimizations

• Carry data as byte arrays as far as possible

• Using binary comparator for sorting

• “Streaming” data through external executables

Page 31: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Performance

Page 32: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Outline

• Map-Reduce and the need for Pig Latin

• Pig Latin

• Compilation into Map-Reduce

• Example Generation

• Future Work

Page 33: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Example Dataflow Program

LOAD(user, url)

LOAD(url, pagerank)

FOREACHuser, canonicalize(url)

JOINon url

GROUPon user

FOREACHuser, AVG(pagerank)

FILTERavgPR> 0.5

Find users that tend to visit

high-pagerank pages

Page 34: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Iterative Process

LOAD(user, url)

LOAD(url, pagerank)

FOREACHuser, canonicalize(url)

JOINon url

GROUPon user

FOREACHuser, AVG(pagerank)

FILTERavgPR> 0.5

Bug in UDFcanonicalize?

Joining on right attribute?

Everything being filtered out?

No Output

Page 35: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

How to do test runs?

• Run with real data– Too inefficient (TBs of data)

• Create smaller data sets (e.g., by sampling)– Empty results due to joins [Chaudhuri et. al. 99], and

selective filters

• Biased sampling for joins– Indexes not always present

Page 36: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Examples to Illustrate Program

LOAD(user, url)

LOAD(url, pagerank)

FOREACHuser, canonicalize(url)

JOINon url

GROUPon user

FOREACHuser, AVG(pagerank)

FILTERavgPR> 0.5

(Amy, cnn.com) (Amy, http://www.frogs.com)(Fred, www.snails.com/index.html)

(Amy, www.cnn.com) (Amy, www.frogs.com)(Fred, www.snails.com)

(www.cnn.com, 0.9) (www.frogs.com, 0.3)(www.snails.com, 0.4)

(Amy, www.cnn.com, 0.9) (Amy, www.frogs.com, 0.3)(Fred, www.snails.com, 0.4)

(Amy, 0.6) (Fred, 0.4)

(Amy, 0.6)

(Amy, www.cnn.com, 0.9) (Amy, www.frogs.com, 0.3)

(Fred, www.snails.com, 0.4)

( Amy,

( Fred, )

)

Page 37: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Value Addition From Examples

• Examples can be used for

– Debugging

– Understanding a program written by someone else

– Learning a new operator, or language

Page 38: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Good Examples: Consistency

LOAD(user, url)

LOAD(url, pagerank)

FOREACHuser, canonicalize(url)

JOINon url

GROUPon user

FOREACHuser, AVG(pagerank)

FILTERavgPR> 0.5

0. Consistency0. Consistency

output example =

operator applied on input example

(Amy, cnn.com) (Amy, http://www.frogs.com)(Fred, www.snails.com/index.html)

(Amy, www.cnn.com) (Amy, www.frogs.com)(Fred, www.snails.com)

Page 39: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Good Examples: Realism

LOAD(user, url)

LOAD(url, pagerank)

FOREACHuser, canonicalize(url)

JOINon url

GROUPon user

FOREACHuser, AVG(pagerank)

FILTERavgPR> 0.5

1. Realism1. Realism

(Amy, cnn.com) (Amy, http://www.frogs.com)(Fred, www.snails.com/index.html)

(Amy, www.cnn.com) (Amy, www.frogs.com)(Fred, www.snails.com)

Page 40: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Good Examples: Completeness

LOAD(user, url)

LOAD(url, pagerank)

FOREACHuser, canonicalize(url)

JOINon url

GROUPon user

FOREACHuser, AVG(pagerank)

FILTERavgPR> 0.5

Demonstrate the salient properties of each operator,

e.g., FILTER

2. Completeness2. Completeness

(Amy, 0.6) (Fred, 0.4)

(Amy, 0.6)

Page 41: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Good Examples: Conciseness

LOAD(user, url)

LOAD(url, pagerank)

FOREACHuser, canonicalize(url)

JOINon url

GROUPon user

FOREACHuser, AVG(pagerank)

FILTERavgPR> 0.5

3. Conciseness3. Conciseness

(Amy, cnn.com) (Amy, http://www.frogs.com)(Fred, www.snails.com/index.html)

(Amy, www.cnn.com) (Amy, www.frogs.com)(Fred, www.snails.com)

Page 42: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Implementation Status

• Available as ILLUSTRATE command in open-source release of Pig

• Available as Eclipse Plugin (PigPen)

• See SIGMOD09 paper for algorithm and experiments

Page 43: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Related Work

• Sawzall– Data processing language on top of map-reduce– Rigid structure of filtering followed by aggregation

• Hive– SQL-like language on top of Map-Reduce

• DryadLINQ– SQL-like language on top of Dryad

• Nested data models– Object-oriented databases

Page 44: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Future / In-Progress Tasks

• Columnar-storage layer• Metadata repository• Profiling and Performance Optimizations• Tight integration with a scripting language– Use loops, conditionals, functions of host language

• Memory Management• Project Suggestions at:

http://wiki.apache.org/pig/ProposedProjects

Page 45: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Credits

Page 46: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing

Summary

• Big demand for parallel data processing– Emerging tools that do not look like SQL DBMS– Programmers like dataflow pipes over static files

• Hence the excitement about Map-Reduce

• But, Map-Reduce is too low-level and rigid

Pig LatinSweet spot between map-reduce and SQL

Pig LatinSweet spot between map-reduce and SQL