higher level scripting languages for data processing apache …apache pig •a data flow framework...
TRANSCRIPT
-
Higher level scripting languages for distributed data processing
Pelle Jakovits
October 4th 2017, Tartu
-
Outline
• Disadvantages of MapReduce
• Higher level scripting languages
• Apache Pig framework
– Pig Latin language
– Execution flow
– Advantages
– Disadvantages
Pelle Jakovits 2/34
-
You already know MapReduce
• MapReduce = Map, GroupBy, Sort, Reduce
• Designed for huge scale data processing
• Provides– Distributed file system
– High scalability
– Automatic parallelization
– Automatic fault recovery• Data is replicated
• Failed tasks are re-executed on other nodes
Pelle Jakovits 3/34
-
Is MapReduce sufficient?
• One of the most used frameworks for large scale data processing
• However, very often MapReduce is not used directly anymore
Why not?
Pelle Jakovits 4/34
-
Mapreduce Disadvantages
• Not suitable for prototyping
– Need to write low level Java code
– A lot of custom code is required, even for the most simplest tasks
• Need a lot of expertise to optimize MapReduce code
• Hard to manage more complex MapReduce job chains
Pelle Jakovits 5/34
-
Dataset abstraction
• MapReduce uses (Key, Value) representation of data. • Move from Key and Value based data structure to sparse
tables with specific schemas.• (Key, Value) ->
(Field 1, Field 2, …, Field N) ->(LastName, FirstName, Balance, Address, City, Date, Bank)
Pelle Jakovits
Key Value
1 line1
2 line2
3 line3
4 line4
5 line5
LastName FirstName Balance Address City (key*) Date Bank
SCHOBER CARL 396.27 8356 TRANSIT TORONTO 03/02/1986 HSBC
SMITH ROSE 16.13 12719 76 ST N EDMONTON 14/08/1985 HSBC
ALLAN MARGO 1926.02 1322 MACDONALD TORONTO 22/07/1982 CIBC
SIDE PAUL 116.36 148 EDMONTON CTR NW
EDMONTON 01/02/1983 CIBC
RIVER ADRIAN 6.37 UNKOWN TORONTO 9/03/1980 CIBC
6/34
-
From SQL to MapReduce
• Find the Min and Max of Balances for each City and Bank for
the year 1983.
• MapReduce– Input: < LineNr, Line >
– Map: < (City,Bank), Balance >
– Reduce: < (City,Bank), (Min(Balance), Max(Balance)) >
VS
• SQL
– Select min(Balance), max(Balance) from dataset wheredate LIKE '%1983%' group by (City,Bank)
Pelle Jakovits 7/34
-
Higher level scripting languages
• Written code or statements are compiled into the code of lower level frameworks– Such as Hadoop MapReduce or Apache Spark
• Compiled code is deployed in actual computing clusters
• Can significantly reduce the time it takes to create data processing applications– Typically requires much less code
– Many included user defined functions and libraries
– More input and output formats supported natively
• Much more suitable for prototyping
Pelle Jakovits 8/34
-
Types of higher level frameworks
• SQL based– Apache Hive
– Spark SQL
– Impala
• New languages – Apache Pig - Pig Latin
– Apache DataFu
• API’s for specific distributed data structures– Apache Spark DataFrames
• List of available frameworks and tools in the Hadoop ecosystem: https://hadoopecosystemtable.github.io/
Pelle Jakovits 9/34
https://hadoopecosystemtable.github.io/
-
Apache Pig
• A data flow framework ontop of Hadoop MapReduce– Retains most of its advantages
– And some of it’s disadvantages
• Models a scripting language– Fast prototyping
• Uses Pig Latin language
– Similar to declarative SQL
– Easier to get started with
• Pig Latin statements are automatically translated into MapReduce jobs
Pelle Jakovits 10/34
-
Pig Latin
• Write complex MapReduce transformations using much simpler scripting language
• Not quite SQL - but similar
• Lazy evaluation
• Compiling is hidden from the user
– Pig Latin is compiled into MapReduce code on the server side
Pelle Jakovits 11/34
-
Pig Data Structures
• Relation– Similar to a table in a relational database
– Consists of a Bag
– Can have nested relations
• Bag– Collection of unordered tuples
• Tuple– An ordered set of fields
– Similar to a row in a relational database
– Can contain any number of fields, does not have to match other tuples
• Fields– A `piece` of data
Pelle Jakovits 12/34
-
Pig Latin Example
• A = LOAD 'student' USING PigStorage() AS(name, age, gpa);
• DUMP A; – (John, 18, 4.0F)
– (Mary, 19, 3.8F)
– (Bill, 20, 3.9F)
– (Joe, 18, 3.8F)
• B = GROUP A BY age;
• C = FOREACH B GENERATE AVG(A.gpa)
Pelle Jakovits 13/34
-
Fields
• Consists of either:– Data atoms - Int, long, float, double, chararray, boolean,
datetime, etc.
– Complex data - Bag, Map, Tuple
• Assigning types to fields– A = LOAD 'student' AS (name:chararray, age:int, gpa:float);
• Referencing Fields– By order - $0, $1, $2
– By name - assigned by user schemas• A = LOAD ‘in.txt‘ AS (age, name, occupation);
Pelle Jakovits 14/34
-
Complex data types
• Tuples - (a, b, c)
• Bags - {(a,b), {c,d}}
• Maps - [martin#18, daniel#27]
• Addressing values inside nested data structures
– client.$0
– author.age
Pelle Jakovits 15/34
-
Loading and storing data
• LOAD– A = LOAD ‘myfile.txt’ USING PigStorage(‘\t’) AS
(f1:int, f2:int, f3:int);
– User defines data loader and delimiters
• STORE– STORE A INTO ‘output_1.txt’ USING PigStorage (‘,’);– STORE B INTO ‘output_2.txt’ USING PigStorage (‘*’);
• Other data loaders– BinStorage– PigDump– TextLoader– Or create a custom one.
Pelle Jakovits 16/34
-
Group .. BY
• A = LOAD 'student' AS(name:chararray, age:int, gpa:float);
• DUMP A;
– (John, 18, 4.0F)
– (Mary, 19, 3.8F)
– (Bill, 20, 3.9F)
– (Joe, 18, 3.8F)
• B = GROUP A BY age;
– (18, {(John, 18, 4.0F), (Joe, 18, 3.8F)})
– (19, {(Mary, 19, 3.8F)})
– (20, {(Bill, 20, 3.9F)})
Pelle Jakovits 17/34
-
FOREACH … GENERATE
• General data transformation statement• Somewhat equal to applying Map in MapReduce • Used to:
– Change the structure of data– Apply a function to each tuple– Flatten complex data to remove nesting
D = foreach B generate age, AVG(GPA);(18, 3.9F)(19, 3.8F)(20, 3.9F)
Pelle Jakovits 18/34
-
Flattening grouped data
Pelle Jakovits
B = GROUP A BY age;(18, {(John, 18, 4.0F), (Joe, 18, 3.8F)} )(19, {(Mary, 19, 3.8F)} )(20, {(Bill, 20, 3.9F)} )
D = foreach B generate AVG(GPA), B;(3.9F, {(John, 18, 4.0F), (Joe, 18, 3.8F)})(3.8F, {(Mary, 19, 3.8F)})(3.9F, {(Bill, 20, 3.9F)})
D = foreach B generate AVG(GPA), flatten(B);( 3.9F, John, 18, 4.0F)( 3.9F, Mary, 19, 3.8F)( 3.9F, Bill, 20, 3.9F)( 3.9F, Joe, 18, 3.8F)
DUMP A;(John, 18, 4.0F)(Mary, 19, 3.8F)(Bill, 20, 3.9F)(Joe, 18, 3.8F)
19/34
-
JOIN
• A = LOAD 'data1' AS(a1:int,a2:int,a3:int);
• B = LOAD 'data2' AS(b1:int,b2:int);
• X = JOIN A BY a1, B BY b1;
DUMP A;
(1,2,3)
(4,2,1)
DUMP B;
(1,3)
(2,7)
(4,6)
DUMP X;
(1,2,3,1,3)
(4,2,1,4,6)
Pelle Jakovits 20/34
-
Union
• A = LOAD 'data' AS(a1:int, a2:int, a3:int);
• B = LOAD 'data' AS(b1:int, b2:int);
• X = UNION A, B;
DUMP A; (1,2,3)
(4,2,1)
DUMP A; (2,4)
(8,9)
DUMP X; (1,2,3)
(4,2,1)
(2,4)
(8,9)
Pelle Jakovits 21/34
-
Other functions
• SAMPLE– A = LOAD 'data' AS (f1:int,f2:int,f3:int);
– X = SAMPLE A 0.01;
– X will contain 1% of tuples in A
• FILTER– A = LOAD 'data' AS (a1:int, a2:int, a3:int);
– X = FILTER A BY a3 == 3;
Pelle Jakovits 22/34
-
Other functions
• DISTINCT – removes duplicate tuples– X = DISTINCT A;
• LIMIT – X = LIMIT B 3;
• SPLIT– SPLIT A INTO
X IF f1
-
WordCount in Pig
A = load '/tmp/books/books';
B = foreach A generateflatten(TOKENIZE((chararray)$0)) as word;
C = group B by word;
D = foreach C generate COUNT(B), group;
store D into '/user/labuser/pelle_jakovits/out';
• Input and output are HDFS folders or files – /tmp/books/books
– /user/labuser/pelle_jakovits/out
• A, B, C, D are relations
• Right hand side contains Pig expressions
Pelle Jakovits 24/34
-
Nested Pig Statements
• Nested Pig Statements can be used to apply more complex data processing statements.
A = LOAD 'Unclaimed_bank_accounts.csv' USING PigStorage(',') AS(last_name,first_name,balance,address,city,last_transaction,bank_name);
B = GROUP A BY city;
C = foreach B {
sorted = order A BY last_transaction;
oldest_balances = limit sorted 5;
GENERATE group AS city, oldest_balances};
Pelle Jakovits 25/34
-
User Defined Functions (UDF)
• When the Built in Pig functions are not enough
• When we want to modify the behavior of built in functions
• Can load Pig UDF from .jar containers
REGISTER myudfs.jar;
A = load '/tmp/books/books';
B = foreach A generate
flatten(myudfs.TOKENIZE((chararray)$0)) as word;
Pelle Jakovits 26/34
-
Pig UDF
public class MYTOKENIZE extends EvalFunc {
TupleFactory mTupleFactory = TupleFactory.getInstance();BagFactory mBagFactory = BagFactory.getInstance();
public DataBag exec(Tuple input) throws IOException {
DataBag output = mBagFactory.newDefaultBag();String o = (String) input.get(0);
StringTokenizer tok = new StringTokenizer(o, " \",()*");while (tok.hasMoreTokens())
output.add(mTupleFactory.newTuple(tok.nextToken()));
return output;
}}
Pelle Jakovits 27/34
-
From Pig Latin to MapReduce
Pelle Jakovits 28/34
-
Pig Latin conversion workflow
Pelle Jakovits 29/34
-
Advantages of Pig
• Easy to Program– ~5% of the code, ~5% of the time required
• Two stages of optimizations– Pig Latin statements– Generated MapReduce code
• Can manage more complex data flows– Easy to use and join multiple separate inputs,
transformations and outputs
• Extensible– Can be extended with User Defined Functions (UDF)
to provide more functionality
Pelle Jakovits 30/34
-
Apache Pig disadvantages
• Retains some MapReduce disadvantages
– Slow start-up and clean-up of MapReduce jobs
• It takes time for Hadoop to schedule MR jobs
– Not suitable for interactive Analytics (OLAP)
• When results are expected in < 1 sec
– Complex applications may require many UDF’s
• Pig loses it’s simplicity over MapReduce
Pelle Jakovits 31/34
-
Bonus
TFIDF in Pig
Pelle Jakovits 32/34
-
TFIDF in Pig
A = load '/tmp/books_small' using PigStorage('\n','-tagPath');B = foreach A generate $0 as file, flatten(TOKENIZE((chararray)$1)) as word;
-- Job 1C = group B by (word, file);D = foreach C generate COUNT(B) as n, group.word, group.file;
-- Job 2E = group D by file;F = foreach E generate SUM(D.n) as N, flatten(D);
-- Job 3G = group F by word;H = foreach G generate COUNT(F.file) as m, flatten(F);
-- Job 4R = foreach H generate file, word, ((1.0*n)/N)*LOG(10.0/m) as tfidf;store R into '/user/labuser/jakovits/pig_out8';
Pelle Jakovits 33/34
-
Next Week
• Next week's practice session
– Processing data with Apache Pig
• Next lecture:
– In-memory data processing – Apache Spark
Pelle Jakovits 34/34