chris olston benjamin reed utkarsh srivastava ravi kumar andrew tomkins pig latin: a not-so-foreign...

31
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So- Foreign Language For Data Processing Research Shimin Chen Big Data Reading Group Presentation

Post on 19-Dec-2015

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big

Chris Olston Benjamin ReedUtkarsh Srivastava

Ravi Kumar Andrew Tomkins

Pig Latin: A Not-So-Foreign Language For Data Processing

Pig Latin: A Not-So-Foreign Language For Data Processing

Research

Shimin Chen

Big Data Reading Group Presentation

Page 2: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big

Data Processing Renaissance

Internet companies swimming in data• E.g. TBs/day at Yahoo!

Data analysis is “inner loop” of product innovation

Data analysts are skilled programmers

Page 3: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big

Data Warehousing …?

ScaleScale Often not scalable enough

$ $ $ $$ $ $ $Prohibitively expensive at web scale

• Up to $200K/TB

SQLSQL• Little control over execution method• Query optimization is hard

• Parallel environment• Little or no statistics• Lots of UDFs

Page 4: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big

New Systems For Data Analysis

Map-Reduce

Apache Hadoop

Dryad

. . .

Page 5: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big

Map-Reduce

Inputrecords

k1 v1

k2 v2

k1 v3

k2 v4

k1 v5

mapmap

mapmap

k1 v1

k1 v3

k1 v5

k2 v2

k2 v4

Outputrecords

reducereduce

reducereduce

Just a group-by-aggregate?Just a group-by-aggregate?

Page 6: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big

The Map-Reduce Appeal

ScaleScaleScalable due to simpler design

• Only parallelizable operations• No transactions

$ $ Runs on cheap commodity hardware

Procedural Control- a processing “pipe”SQL SQL

Page 7: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big

Disadvantages

1. Extremely rigid data flow

Other flows constantly hacked in

Join, Union Split

MM RR

MM MM RR MM

Chains

2. Common operations must be coded by hand• Join, filter, projection, aggregates, sorting, distinct

3. Semantics hidden inside map-reduce functions• Difficult to maintain, extend, and optimize

Page 8: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big

Pros And Cons

Need a high-level, general data flow language

Page 9: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big

Enter Pig Latin

Pig LatinPig Latin

Need a high-level, general data flow language

Page 10: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big

Outline

• Map-Reduce and the need for Pig Latin

• Pig Latin example

• Salient features

• Implementation

Page 11: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big

Example Data Analysis Task

User Url Time

Amy cnn.com 8:00

Amy bbc.com 10:00

Amy flickr.com 10:05

Fred cnn.com 12:00

Find the top 10 most visited pages in each category

Url Category PageRank

cnn.com News 0.9

bbc.com News 0.8

flickr.com Photos 0.7

espn.com Sports 0.9

Visits Url Info

Page 12: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big

Data Flow

Load VisitsLoad Visits

Group by urlGroup by url

Foreach urlgenerate count

Foreach urlgenerate count Load Url InfoLoad Url Info

Join on urlJoin on url

Group by categoryGroup by category

Foreach categorygenerate top10 urls

Foreach categorygenerate top10 urls

Page 13: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big

In Pig Latinvisits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(visits);

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);visitCounts = join visitCounts by url, urlInfo by url;

gCategories = group visitCounts by category;topUrls = foreach gCategories generate top(visitCounts,10);

store topUrls into ‘/data/topUrls’;

Page 14: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big

Outline

• Map-Reduce and the need for Pig Latin

• Pig Latin example

• Salient features

• Implementation

Page 15: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big

Step-by-step Procedural ControlTarget users are entrenched procedural programmers

The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data.

The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data.

Jasmine NovakEngineer, Yahoo!

• Automatic query optimization is hard • Pig Latin does not preclude optimization

With the various interleaved clauses in SQL, it is difficult to know what is actually happening sequentially. With Pig, the data nesting and the temporary tables get abstracted away. Pig has fewer primitives than SQL does, but it’s more powerful.

With the various interleaved clauses in SQL, it is difficult to know what is actually happening sequentially. With Pig, the data nesting and the temporary tables get abstracted away. Pig has fewer primitives than SQL does, but it’s more powerful.

David CiemiewiczSearch Excellence, Yahoo!

Page 16: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big

visits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(urlVisits);

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);

visitCounts = join visitCounts by url, urlInfo by url;gCategories = group visitCounts by category;topUrls = foreach gCategories generate top(visitCounts,10);

store topUrls into ‘/data/topUrls’;

Quick Start and Interoperability

Operates directly over filesOperates directly over files

Page 17: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big

visits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(urlVisits);

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);

visitCounts = join visitCounts by url, urlInfo by url;gCategories = group visitCounts by category;topUrls = foreach gCategories generate top(visitCounts,10);

store topUrls into ‘/data/topUrls’;

Quick Start and Interoperability

Schemas optional; Can be assigned dynamically

Schemas optional; Can be assigned dynamically

Page 18: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big

visits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(urlVisits);

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);

visitCounts = join visitCounts by url, urlInfo by url;gCategories = group visitCounts by category;topUrls = foreach gCategories generate top(visitCounts,10);

store topUrls into ‘/data/topUrls’;

User-Code as a First-Class Citizen

User-defined functions (UDFs) can be used in every construct

• Load, Store• Group, Filter, Foreach

User-defined functions (UDFs) can be used in every construct

• Load, Store• Group, Filter, Foreach

Page 19: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big

• Pig Latin has a fully-nestable data model with:– Atomic values, tuples, bags (lists), and maps

• More natural to programmers than flat tuples• Avoids expensive joins• See paper

Nested Data Model

yahoo ,financeemailnews

Page 20: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big

Pig Latin Operators• Input/Output:

– Load– Store

• Operations on a single bag– Foreach– Filter– Order– Group– Distinct

• Operations on multiple bags– Co-group, Join– Union

From paper

Page 21: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big

Outline

• Map-Reduce and the need for Pig Latin

• Pig Latin example

• Novel features

• Implementation

Page 22: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big

Implementation

cluster

Hadoop Map-Reduce

Hadoop Map-Reduce

PigPig

SQL

automaticrewrite +optimize

or

or

user

Pig is open-source.http://incubator.apache.org/pig

Pig is open-source.http://incubator.apache.org/pig

Page 23: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big

Compilation into Map-Reduce

Load VisitsLoad Visits

Group by urlGroup by url

Foreach urlgenerate count

Foreach urlgenerate count Load Url InfoLoad Url Info

Join on urlJoin on url

Group by categoryGroup by category

Foreach categorygenerate top10(urls)

Foreach categorygenerate top10(urls)

Map1

Reduce1Map2

Reduce2

Map3

Reduce3

Every group or join operation forms a map-reduce boundary

Other operations pipelined into map and reduce phases

Page 24: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big

Pig Pen: Debugging Environment

From paper

Page 25: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big

Usage

• First production release about a year ago

• 150+ early adopters within Yahoo!

• Over 25% of the Yahoo! map-reduce user base

Page 26: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big

Related Work

• Sawzall– Data processing language on top of map-reduce– Rigid structure of filtering followed by aggregation

• DryadLINQ– SQL-like language on top of Dryad

• Nested data models– Object-oriented databases

Page 27: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big

Distributed Sorting in DryadLinq

public static IQueryable<TSource>DSort<TSource, TKey>(this IQueryable<TSource> source, Expression<Func<TSource, TKey>> keySelector, int pcount){ var samples = source.Apply(x => Sampling(x)); var keys = samples.Apply(x => ComputeKeys(x, pcount)); var parts = source.RangePartition(keySelector, keys); return parts.OrderBy(keySelector);}

From Mihai Budiu’s slides on “Cluster Computing with DryadLINQ”

Page 28: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big

Sawzall Example

proto “querylog.proto”queries_per_degree: table

sum[lat: int][lon:int] of int;log_record : QueryLogProto = input;loc: Location = locationinfo(log_record.ip);emit queries_per_degree[int(loc.lat)]

[int(loc.lon)]<-1

Page 29: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big

Future Work

• Optional “safe” query optimizer– Performs only high-confidence rewrites

• User interface– Boxes and arrows UI– Promote collaboration, sharing code fragments and

UDFs

• Tight integration with a scripting language– Use loops, conditionals of host language

Page 30: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big

Arun MurthyPi SongSanthosh SrinivasanAmir Youssefi

Shubham ChopraAlan GatesShravan NarayanamurthyOlga Natkovich

Credits

Page 31: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big

Summary

• Big demand for parallel data processing– Emerging tools that do not look like SQL DBMS– Programmers like dataflow pipes over static files

• Hence the excitement about Map-Reduce

• But, Map-Reduce is too low-level and rigid

Pig LatinSweet spot between map-reduce and SQL

Pig LatinSweet spot between map-reduce and SQL