higher level scripting languages for data processing apache …apache pig •a data flow framework...

Higher level scripting languages for distributed data processing

Pelle Jakovits

October 4th 2017, Tartu

Outline

• Disadvantages of MapReduce

• Higher level scripting languages

• Apache Pig framework

– Pig Latin language

– Execution flow

– Advantages

– Disadvantages

Pelle Jakovits 2/34

You already know MapReduce

• MapReduce = Map, GroupBy, Sort, Reduce

• Designed for huge scale data processing

• Provides– Distributed file system

– High scalability

– Automatic parallelization

– Automatic fault recovery• Data is replicated

• Failed tasks are re-executed on other nodes

Pelle Jakovits 3/34

Is MapReduce sufficient?

• One of the most used frameworks for large scale data processing

• However, very often MapReduce is not used directly anymore

Why not?

Pelle Jakovits 4/34

Mapreduce Disadvantages

• Not suitable for prototyping

– Need to write low level Java code

– A lot of custom code is required, even for the most simplest tasks

• Need a lot of expertise to optimize MapReduce code

• Hard to manage more complex MapReduce job chains

Pelle Jakovits 5/34

Dataset abstraction

• MapReduce uses (Key, Value) representation of data. • Move from Key and Value based data structure to sparse

tables with specific schemas.• (Key, Value) ->

(Field 1, Field 2, …, Field N) ->(LastName, FirstName, Balance, Address, City, Date, Bank)

Pelle Jakovits

Key Value

1 line1

2 line2

3 line3

4 line4

5 line5

LastName FirstName Balance Address City (key*) Date Bank

SCHOBER CARL 396.27 8356 TRANSIT TORONTO 03/02/1986 HSBC

SMITH ROSE 16.13 12719 76 ST N EDMONTON 14/08/1985 HSBC

ALLAN MARGO 1926.02 1322 MACDONALD TORONTO 22/07/1982 CIBC

SIDE PAUL 116.36 148 EDMONTON CTR NW

EDMONTON 01/02/1983 CIBC

RIVER ADRIAN 6.37 UNKOWN TORONTO 9/03/1980 CIBC

6/34

From SQL to MapReduce

• Find the Min and Max of Balances for each City and Bank for

the year 1983.

• MapReduce– Input: < LineNr, Line >

– Map: < (City,Bank), Balance >

– Reduce: < (City,Bank), (Min(Balance), Max(Balance)) >

VS

• SQL

– Select min(Balance), max(Balance) from dataset wheredate LIKE '%1983%' group by (City,Bank)

Pelle Jakovits 7/34

Higher level scripting languages

• Written code or statements are compiled into the code of lower level frameworks– Such as Hadoop MapReduce or Apache Spark

• Compiled code is deployed in actual computing clusters

• Can significantly reduce the time it takes to create data processing applications– Typically requires much less code

– Many included user defined functions and libraries

– More input and output formats supported natively

• Much more suitable for prototyping

Pelle Jakovits 8/34

Types of higher level frameworks

• SQL based– Apache Hive

– Spark SQL

– Impala

• New languages – Apache Pig - Pig Latin

– Apache DataFu

• API’s for specific distributed data structures– Apache Spark DataFrames

• List of available frameworks and tools in the Hadoop ecosystem: https://hadoopecosystemtable.github.io/

Pelle Jakovits 9/34

https://hadoopecosystemtable.github.io/

Apache Pig

• A data flow framework ontop of Hadoop MapReduce– Retains most of its advantages

– And some of it’s disadvantages

• Models a scripting language– Fast prototyping

• Uses Pig Latin language

– Similar to declarative SQL

– Easier to get started with

• Pig Latin statements are automatically translated into MapReduce jobs

Pelle Jakovits 10/34

Pig Latin

• Write complex MapReduce transformations using much simpler scripting language

• Not quite SQL - but similar

• Lazy evaluation

• Compiling is hidden from the user

– Pig Latin is compiled into MapReduce code on the server side


Pig Data Structures

• Relation– Similar to a table in a relational database

– Consists of a Bag

– Can have nested relations

• Bag– Collection of unordered tuples

• Tuple– An ordered set of fields

– Similar to a row in a relational database

– Can contain any number of fields, does not have to match other tuples

• Fields– A `piece` of data


Pig Latin Example

• A = LOAD 'student' USING PigStorage() AS(name, age, gpa);

• DUMP A; – (John, 18, 4.0F)

– (Mary, 19, 3.8F)

– (Bill, 20, 3.9F)

– (Joe, 18, 3.8F)

• B = GROUP A BY age;

• C = FOREACH B GENERATE AVG(A.gpa)


Fields

• Consists of either:– Data atoms - Int, long, float, double, chararray, boolean,

datetime, etc.

– Complex data - Bag, Map, Tuple

• Assigning types to fields– A = LOAD 'student' AS (name:chararray, age:int, gpa:float);

• Referencing Fields– By order - $0, $1, $2

– By name - assigned by user schemas• A = LOAD ‘in.txt‘ AS (age, name, occupation);


Complex data types

• Tuples - (a, b, c)

• Bags - {(a,b), {c,d}}

• Maps - [martin#18, daniel#27]

• Addressing values inside nested data structures

– client.$0

– author.age


Loading and storing data

• LOAD– A = LOAD ‘myfile.txt’ USING PigStorage(‘\t’) AS

(f1:int, f2:int, f3:int);

– User defines data loader and delimiters

• STORE– STORE A INTO ‘output_1.txt’ USING PigStorage (‘,’);– STORE B INTO ‘output_2.txt’ USING PigStorage (‘*’);

• Other data loaders– BinStorage– PigDump– TextLoader– Or create a custom one.


Group .. BY

• A = LOAD 'student' AS(name:chararray, age:int, gpa:float);

• DUMP A;

– (John, 18, 4.0F)

– (Mary, 19, 3.8F)

– (Bill, 20, 3.9F)

– (Joe, 18, 3.8F)

• B = GROUP A BY age;

– (18, {(John, 18, 4.0F), (Joe, 18, 3.8F)})

– (19, {(Mary, 19, 3.8F)})

– (20, {(Bill, 20, 3.9F)})


FOREACH … GENERATE

• General data transformation statement• Somewhat equal to applying Map in MapReduce • Used to:

– Change the structure of data– Apply a function to each tuple– Flatten complex data to remove nesting

D = foreach B generate age, AVG(GPA);(18, 3.9F)(19, 3.8F)(20, 3.9F)


Flattening grouped data

Pelle Jakovits

B = GROUP A BY age;(18, {(John, 18, 4.0F), (Joe, 18, 3.8F)} )(19, {(Mary, 19, 3.8F)} )(20, {(Bill, 20, 3.9F)} )

D = foreach B generate AVG(GPA), B;(3.9F, {(John, 18, 4.0F), (Joe, 18, 3.8F)})(3.8F, {(Mary, 19, 3.8F)})(3.9F, {(Bill, 20, 3.9F)})

D = foreach B generate AVG(GPA), flatten(B);( 3.9F, John, 18, 4.0F)( 3.9F, Mary, 19, 3.8F)( 3.9F, Bill, 20, 3.9F)( 3.9F, Joe, 18, 3.8F)

DUMP A;(John, 18, 4.0F)(Mary, 19, 3.8F)(Bill, 20, 3.9F)(Joe, 18, 3.8F)

19/34

JOIN

• A = LOAD 'data1' AS(a1:int,a2:int,a3:int);

• B = LOAD 'data2' AS(b1:int,b2:int);

• X = JOIN A BY a1, B BY b1;

DUMP A;

(1,2,3)

(4,2,1)

DUMP B;

(1,3)

(2,7)

(4,6)

DUMP X;

(1,2,3,1,3)

(4,2,1,4,6)


Union

• A = LOAD 'data' AS(a1:int, a2:int, a3:int);

• B = LOAD 'data' AS(b1:int, b2:int);

• X = UNION A, B;

DUMP A; (1,2,3)

(4,2,1)

DUMP A; (2,4)

(8,9)

DUMP X; (1,2,3)

(4,2,1)

(2,4)

(8,9)


Other functions

• SAMPLE– A = LOAD 'data' AS (f1:int,f2:int,f3:int);

– X = SAMPLE A 0.01;

– X will contain 1% of tuples in A

• FILTER– A = LOAD 'data' AS (a1:int, a2:int, a3:int);

– X = FILTER A BY a3 == 3;


Other functions

• DISTINCT – removes duplicate tuples– X = DISTINCT A;

• LIMIT – X = LIMIT B 3;

• SPLIT– SPLIT A INTO

X IF f1

WordCount in Pig

A = load '/tmp/books/books';

B = foreach A generateflatten(TOKENIZE((chararray)$0)) as word;

C = group B by word;

D = foreach C generate COUNT(B), group;

store D into '/user/labuser/pelle_jakovits/out';

• Input and output are HDFS folders or files – /tmp/books/books

– /user/labuser/pelle_jakovits/out

• A, B, C, D are relations

• Right hand side contains Pig expressions


Nested Pig Statements

• Nested Pig Statements can be used to apply more complex data processing statements.

A = LOAD 'Unclaimed_bank_accounts.csv' USING PigStorage(',') AS(last_name,first_name,balance,address,city,last_transaction,bank_name);

B = GROUP A BY city;

C = foreach B {

sorted = order A BY last_transaction;

oldest_balances = limit sorted 5;

GENERATE group AS city, oldest_balances};


User Defined Functions (UDF)

• When the Built in Pig functions are not enough

• When we want to modify the behavior of built in functions

• Can load Pig UDF from .jar containers

REGISTER myudfs.jar;

A = load '/tmp/books/books';

B = foreach A generate

flatten(myudfs.TOKENIZE((chararray)$0)) as word;


Pig UDF

public class MYTOKENIZE extends EvalFunc {

TupleFactory mTupleFactory = TupleFactory.getInstance();BagFactory mBagFactory = BagFactory.getInstance();

public DataBag exec(Tuple input) throws IOException {

DataBag output = mBagFactory.newDefaultBag();String o = (String) input.get(0);

StringTokenizer tok = new StringTokenizer(o, " \",()*");while (tok.hasMoreTokens())

output.add(mTupleFactory.newTuple(tok.nextToken()));

return output;

}}


From Pig Latin to MapReduce


Pig Latin conversion workflow


Advantages of Pig

• Easy to Program– ~5% of the code, ~5% of the time required

• Two stages of optimizations– Pig Latin statements– Generated MapReduce code

• Can manage more complex data flows– Easy to use and join multiple separate inputs,

transformations and outputs

• Extensible– Can be extended with User Defined Functions (UDF)

to provide more functionality


Apache Pig disadvantages

• Retains some MapReduce disadvantages

– Slow start-up and clean-up of MapReduce jobs

• It takes time for Hadoop to schedule MR jobs

– Not suitable for interactive Analytics (OLAP)

• When results are expected in < 1 sec

– Complex applications may require many UDF’s

• Pig loses it’s simplicity over MapReduce


Bonus

TFIDF in Pig


TFIDF in Pig

A = load '/tmp/books_small' using PigStorage('\n','-tagPath');B = foreach A generate $0 as file, flatten(TOKENIZE((chararray)$1)) as word;

-- Job 1C = group B by (word, file);D = foreach C generate COUNT(B) as n, group.word, group.file;

-- Job 2E = group D by file;F = foreach E generate SUM(D.n) as N, flatten(D);

-- Job 3G = group F by word;H = foreach G generate COUNT(F.file) as m, flatten(F);

-- Job 4R = foreach H generate file, word, ((1.0*n)/N)*LOG(10.0/m) as tfidf;store R into '/user/labuser/jakovits/pig_out8';


Next Week

• Next week's practice session

– Processing data with Apache Pig

• Next lecture:

– In-memory data processing – Apache Spark


higher level scripting languages for data processing apache …apache pig •a data flow framework...

Documents