higher level scripting languages for data processing apache …apache pig •a data flow framework...

34
Higher level scripting languages for distributed data processing Pelle Jakovits October 4th 2017, Tartu

Upload: others

Post on 02-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Higher level scripting languages for distributed data processing

    Pelle Jakovits

    October 4th 2017, Tartu

  • Outline

    • Disadvantages of MapReduce

    • Higher level scripting languages

    • Apache Pig framework

    – Pig Latin language

    – Execution flow

    – Advantages

    – Disadvantages

    Pelle Jakovits 2/34

  • You already know MapReduce

    • MapReduce = Map, GroupBy, Sort, Reduce

    • Designed for huge scale data processing

    • Provides– Distributed file system

    – High scalability

    – Automatic parallelization

    – Automatic fault recovery• Data is replicated

    • Failed tasks are re-executed on other nodes

    Pelle Jakovits 3/34

  • Is MapReduce sufficient?

    • One of the most used frameworks for large scale data processing

    • However, very often MapReduce is not used directly anymore

    Why not?

    Pelle Jakovits 4/34

  • Mapreduce Disadvantages

    • Not suitable for prototyping

    – Need to write low level Java code

    – A lot of custom code is required, even for the most simplest tasks

    • Need a lot of expertise to optimize MapReduce code

    • Hard to manage more complex MapReduce job chains

    Pelle Jakovits 5/34

  • Dataset abstraction

    • MapReduce uses (Key, Value) representation of data. • Move from Key and Value based data structure to sparse

    tables with specific schemas.• (Key, Value) ->

    (Field 1, Field 2, …, Field N) ->(LastName, FirstName, Balance, Address, City, Date, Bank)

    Pelle Jakovits

    Key Value

    1 line1

    2 line2

    3 line3

    4 line4

    5 line5

    LastName FirstName Balance Address City (key*) Date Bank

    SCHOBER CARL 396.27 8356 TRANSIT TORONTO 03/02/1986 HSBC

    SMITH ROSE 16.13 12719 76 ST N EDMONTON 14/08/1985 HSBC

    ALLAN MARGO 1926.02 1322 MACDONALD TORONTO 22/07/1982 CIBC

    SIDE PAUL 116.36 148 EDMONTON CTR NW

    EDMONTON 01/02/1983 CIBC

    RIVER ADRIAN 6.37 UNKOWN TORONTO 9/03/1980 CIBC

    6/34

  • From SQL to MapReduce

    • Find the Min and Max of Balances for each City and Bank for

    the year 1983.

    • MapReduce– Input: < LineNr, Line >

    – Map: < (City,Bank), Balance >

    – Reduce: < (City,Bank), (Min(Balance), Max(Balance)) >

    VS

    • SQL

    – Select min(Balance), max(Balance) from dataset wheredate LIKE '%1983%' group by (City,Bank)

    Pelle Jakovits 7/34

  • Higher level scripting languages

    • Written code or statements are compiled into the code of lower level frameworks– Such as Hadoop MapReduce or Apache Spark

    • Compiled code is deployed in actual computing clusters

    • Can significantly reduce the time it takes to create data processing applications– Typically requires much less code

    – Many included user defined functions and libraries

    – More input and output formats supported natively

    • Much more suitable for prototyping

    Pelle Jakovits 8/34

  • Types of higher level frameworks

    • SQL based– Apache Hive

    – Spark SQL

    – Impala

    • New languages – Apache Pig - Pig Latin

    – Apache DataFu

    • API’s for specific distributed data structures– Apache Spark DataFrames

    • List of available frameworks and tools in the Hadoop ecosystem: https://hadoopecosystemtable.github.io/

    Pelle Jakovits 9/34

    https://hadoopecosystemtable.github.io/

  • Apache Pig

    • A data flow framework ontop of Hadoop MapReduce– Retains most of its advantages

    – And some of it’s disadvantages

    • Models a scripting language– Fast prototyping

    • Uses Pig Latin language

    – Similar to declarative SQL

    – Easier to get started with

    • Pig Latin statements are automatically translated into MapReduce jobs

    Pelle Jakovits 10/34

  • Pig Latin

    • Write complex MapReduce transformations using much simpler scripting language

    • Not quite SQL - but similar

    • Lazy evaluation

    • Compiling is hidden from the user

    – Pig Latin is compiled into MapReduce code on the server side

    Pelle Jakovits 11/34

  • Pig Data Structures

    • Relation– Similar to a table in a relational database

    – Consists of a Bag

    – Can have nested relations

    • Bag– Collection of unordered tuples

    • Tuple– An ordered set of fields

    – Similar to a row in a relational database

    – Can contain any number of fields, does not have to match other tuples

    • Fields– A `piece` of data

    Pelle Jakovits 12/34

  • Pig Latin Example

    • A = LOAD 'student' USING PigStorage() AS(name, age, gpa);

    • DUMP A; – (John, 18, 4.0F)

    – (Mary, 19, 3.8F)

    – (Bill, 20, 3.9F)

    – (Joe, 18, 3.8F)

    • B = GROUP A BY age;

    • C = FOREACH B GENERATE AVG(A.gpa)

    Pelle Jakovits 13/34

  • Fields

    • Consists of either:– Data atoms - Int, long, float, double, chararray, boolean,

    datetime, etc.

    – Complex data - Bag, Map, Tuple

    • Assigning types to fields– A = LOAD 'student' AS (name:chararray, age:int, gpa:float);

    • Referencing Fields– By order - $0, $1, $2

    – By name - assigned by user schemas• A = LOAD ‘in.txt‘ AS (age, name, occupation);

    Pelle Jakovits 14/34

  • Complex data types

    • Tuples - (a, b, c)

    • Bags - {(a,b), {c,d}}

    • Maps - [martin#18, daniel#27]

    • Addressing values inside nested data structures

    – client.$0

    – author.age

    Pelle Jakovits 15/34

  • Loading and storing data

    • LOAD– A = LOAD ‘myfile.txt’ USING PigStorage(‘\t’) AS

    (f1:int, f2:int, f3:int);

    – User defines data loader and delimiters

    • STORE– STORE A INTO ‘output_1.txt’ USING PigStorage (‘,’);– STORE B INTO ‘output_2.txt’ USING PigStorage (‘*’);

    • Other data loaders– BinStorage– PigDump– TextLoader– Or create a custom one.

    Pelle Jakovits 16/34

  • Group .. BY

    • A = LOAD 'student' AS(name:chararray, age:int, gpa:float);

    • DUMP A;

    – (John, 18, 4.0F)

    – (Mary, 19, 3.8F)

    – (Bill, 20, 3.9F)

    – (Joe, 18, 3.8F)

    • B = GROUP A BY age;

    – (18, {(John, 18, 4.0F), (Joe, 18, 3.8F)})

    – (19, {(Mary, 19, 3.8F)})

    – (20, {(Bill, 20, 3.9F)})

    Pelle Jakovits 17/34

  • FOREACH … GENERATE

    • General data transformation statement• Somewhat equal to applying Map in MapReduce • Used to:

    – Change the structure of data– Apply a function to each tuple– Flatten complex data to remove nesting

    D = foreach B generate age, AVG(GPA);(18, 3.9F)(19, 3.8F)(20, 3.9F)

    Pelle Jakovits 18/34

  • Flattening grouped data

    Pelle Jakovits

    B = GROUP A BY age;(18, {(John, 18, 4.0F), (Joe, 18, 3.8F)} )(19, {(Mary, 19, 3.8F)} )(20, {(Bill, 20, 3.9F)} )

    D = foreach B generate AVG(GPA), B;(3.9F, {(John, 18, 4.0F), (Joe, 18, 3.8F)})(3.8F, {(Mary, 19, 3.8F)})(3.9F, {(Bill, 20, 3.9F)})

    D = foreach B generate AVG(GPA), flatten(B);( 3.9F, John, 18, 4.0F)( 3.9F, Mary, 19, 3.8F)( 3.9F, Bill, 20, 3.9F)( 3.9F, Joe, 18, 3.8F)

    DUMP A;(John, 18, 4.0F)(Mary, 19, 3.8F)(Bill, 20, 3.9F)(Joe, 18, 3.8F)

    19/34

  • JOIN

    • A = LOAD 'data1' AS(a1:int,a2:int,a3:int);

    • B = LOAD 'data2' AS(b1:int,b2:int);

    • X = JOIN A BY a1, B BY b1;

    DUMP A;

    (1,2,3)

    (4,2,1)

    DUMP B;

    (1,3)

    (2,7)

    (4,6)

    DUMP X;

    (1,2,3,1,3)

    (4,2,1,4,6)

    Pelle Jakovits 20/34

  • Union

    • A = LOAD 'data' AS(a1:int, a2:int, a3:int);

    • B = LOAD 'data' AS(b1:int, b2:int);

    • X = UNION A, B;

    DUMP A; (1,2,3)

    (4,2,1)

    DUMP A; (2,4)

    (8,9)

    DUMP X; (1,2,3)

    (4,2,1)

    (2,4)

    (8,9)

    Pelle Jakovits 21/34

  • Other functions

    • SAMPLE– A = LOAD 'data' AS (f1:int,f2:int,f3:int);

    – X = SAMPLE A 0.01;

    – X will contain 1% of tuples in A

    • FILTER– A = LOAD 'data' AS (a1:int, a2:int, a3:int);

    – X = FILTER A BY a3 == 3;

    Pelle Jakovits 22/34

  • Other functions

    • DISTINCT – removes duplicate tuples– X = DISTINCT A;

    • LIMIT – X = LIMIT B 3;

    • SPLIT– SPLIT A INTO

    X IF f1

  • WordCount in Pig

    A = load '/tmp/books/books';

    B = foreach A generateflatten(TOKENIZE((chararray)$0)) as word;

    C = group B by word;

    D = foreach C generate COUNT(B), group;

    store D into '/user/labuser/pelle_jakovits/out';

    • Input and output are HDFS folders or files – /tmp/books/books

    – /user/labuser/pelle_jakovits/out

    • A, B, C, D are relations

    • Right hand side contains Pig expressions

    Pelle Jakovits 24/34

  • Nested Pig Statements

    • Nested Pig Statements can be used to apply more complex data processing statements.

    A = LOAD 'Unclaimed_bank_accounts.csv' USING PigStorage(',') AS(last_name,first_name,balance,address,city,last_transaction,bank_name);

    B = GROUP A BY city;

    C = foreach B {

    sorted = order A BY last_transaction;

    oldest_balances = limit sorted 5;

    GENERATE group AS city, oldest_balances};

    Pelle Jakovits 25/34

  • User Defined Functions (UDF)

    • When the Built in Pig functions are not enough

    • When we want to modify the behavior of built in functions

    • Can load Pig UDF from .jar containers

    REGISTER myudfs.jar;

    A = load '/tmp/books/books';

    B = foreach A generate

    flatten(myudfs.TOKENIZE((chararray)$0)) as word;

    Pelle Jakovits 26/34

  • Pig UDF

    public class MYTOKENIZE extends EvalFunc {

    TupleFactory mTupleFactory = TupleFactory.getInstance();BagFactory mBagFactory = BagFactory.getInstance();

    public DataBag exec(Tuple input) throws IOException {

    DataBag output = mBagFactory.newDefaultBag();String o = (String) input.get(0);

    StringTokenizer tok = new StringTokenizer(o, " \",()*");while (tok.hasMoreTokens())

    output.add(mTupleFactory.newTuple(tok.nextToken()));

    return output;

    }}

    Pelle Jakovits 27/34

  • From Pig Latin to MapReduce

    Pelle Jakovits 28/34

  • Pig Latin conversion workflow

    Pelle Jakovits 29/34

  • Advantages of Pig

    • Easy to Program– ~5% of the code, ~5% of the time required

    • Two stages of optimizations– Pig Latin statements– Generated MapReduce code

    • Can manage more complex data flows– Easy to use and join multiple separate inputs,

    transformations and outputs

    • Extensible– Can be extended with User Defined Functions (UDF)

    to provide more functionality

    Pelle Jakovits 30/34

  • Apache Pig disadvantages

    • Retains some MapReduce disadvantages

    – Slow start-up and clean-up of MapReduce jobs

    • It takes time for Hadoop to schedule MR jobs

    – Not suitable for interactive Analytics (OLAP)

    • When results are expected in < 1 sec

    – Complex applications may require many UDF’s

    • Pig loses it’s simplicity over MapReduce

    Pelle Jakovits 31/34

  • Bonus

    TFIDF in Pig

    Pelle Jakovits 32/34

  • TFIDF in Pig

    A = load '/tmp/books_small' using PigStorage('\n','-tagPath');B = foreach A generate $0 as file, flatten(TOKENIZE((chararray)$1)) as word;

    -- Job 1C = group B by (word, file);D = foreach C generate COUNT(B) as n, group.word, group.file;

    -- Job 2E = group D by file;F = foreach E generate SUM(D.n) as N, flatten(D);

    -- Job 3G = group F by word;H = foreach G generate COUNT(F.file) as m, flatten(F);

    -- Job 4R = foreach H generate file, word, ((1.0*n)/N)*LOG(10.0/m) as tfidf;store R into '/user/labuser/jakovits/pig_out8';

    Pelle Jakovits 33/34

  • Next Week

    • Next week's practice session

    – Processing data with Apache Pig

    • Next lecture:

    – In-memory data processing – Apache Spark

    Pelle Jakovits 34/34