making pig fly

Upload: nagybaly

Post on 02-Jun-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 Making Pig Fly

    1/36

    Hortonworks Inc. 2011

    Daniel Dai (@daijy)Thejas Nair (@thejasn)

    Page 1

    Making Pig FlyOptimizing Data Processing on Hadoop

  • 8/10/2019 Making Pig Fly

    2/36

    Hortonworks Inc. 2011

    What is Apache Pig?

    Page 2Architecting the Future of Big Data

    Pig Latin, a high leveldata processinglanguage.

    An engine that

    executes Pig Latinlocally or on aHadoop cluster.

    Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/

  • 8/10/2019 Making Pig Fly

    3/36

    Hortonworks Inc. 2011

    Pig-latin example

    Page 3Architecting the Future of Big Data

    Query : Get the list of web pages visited by userswhoseage is between 20 and 29 years.

    USERS = loadusersas (uid, age);

    USERS_20s = filterUSERS by age >= 20 andage

  • 8/10/2019 Making Pig Fly

    4/36

    Hortonworks Inc. 2011

    Why pig ?

    Page 4Architecting the Future of Big Data

    Faster developmentFewer lines of codeDont re-invent the wheel

    Flexible

    Metadata is optionalExtensibleProcedural programming

    Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/

  • 8/10/2019 Making Pig Fly

    5/36

    Hortonworks Inc. 2011

    Pig optimizations

    Page 5Architecting the Future of Big Data

    Ideally user should not have to botherReality

    Pig is still young and immaturePig does not have the whole picture

    Cluster configurationData histogram

    Pig philosophy: Pig is docile

  • 8/10/2019 Making Pig Fly

    6/36

    Hortonworks Inc. 2011

    Pig optimizations

    Page 6Architecting the Future of Big Data

    What pig does for youDo safe transformations of query to optimizeOptimized operations (join, sort)

    What you doOrganize input in optimal wayOptimize pig-latin queryTell pig what join/group algorithm to use

  • 8/10/2019 Making Pig Fly

    7/36 Hortonworks Inc. 2011

    Rule based optimizer

    Page 7Architecting the Future of Big Data

    Column prunerPush up filterPush down flatten

    Push up limitPartition pruningGlobal optimizer

  • 8/10/2019 Making Pig Fly

    8/36 Hortonworks Inc. 2011

    Column Pruner

    Page 8Architecting the Future of Big Data

    Pig will do column pruning automatically

    Cases Pig will not do column pruningautomatically

    No schema specified in load statement

    A = loadinput as(a0, a1, a2);B = foreachA generatea0+a1;C = orderB by$0;StoreC intooutput;

    Pig will prunea2 automatically

    A = loadinput;B = orderA by$0;C = foreachB generate$0+$1;StoreC intooutput;

    A = loadinput;A1 = foreachA generate$0, $1;B = orderA1 by$0;C = foreachB generate$0+$1;StoreC intooutput;

    DIY

  • 8/10/2019 Making Pig Fly

    9/36 Hortonworks Inc. 2011

    Column Pruner

    Page 9Architecting the Future of Big Data

    Another case Pig does not do columnpruningPig does not keep track of unused column aftergrouping

    A = loadinput as(a0, a1, a2);B = groupAby a0;C = foreach B generate SUM(A.a1);Store C into output;

    DIY

    A = loadinput as(a0, a1, a2);A1 = foreachA generate$0, $1;B = groupA1by a0;C = foreach B generate SUM(A.a1);Store C into output;

  • 8/10/2019 Making Pig Fly

    10/36 Hortonworks Inc. 2011

    Push up filter

    Page 10Architecting the Future of Big Data

    Pig split the filter condition before push

    A

    Join

    a0>0 && b0>10

    B

    Filter

    A

    Join

    a0>0

    B

    Filter b0>10

    Original query Split filter condition

    A

    Join

    a0>0

    B

    Filter b0>10

    Push up filter

  • 8/10/2019 Making Pig Fly

    11/36 Hortonworks Inc. 2011

    Other push up/down

    Page 11Architecting the Future of Big Data

    Push down flatten

    Push up limit

    Load

    Flatten

    Order

    Load

    Flatten

    Order

  • 8/10/2019 Making Pig Fly

    12/36 Hortonworks Inc. 2011

    Partition pruning

    Page 12Architecting the Future of Big Data

    Prune unnecessary partitions entirelyHCatLoader

    2010

    2011

    2012

    HCatLoaderFilter

    (year>=2011)

    2010

    2011

    2012

    HCatLoader(year>=2011)

  • 8/10/2019 Making Pig Fly

    13/36 Hortonworks Inc. 2011

    Intermediate file compression

    Page 13Architecting the Future of Big Data

    Pig Script

    map 1

    reduce 1

    map 2

    reduce 2

    Pig temp file

    map 3

    reduce 3

    Pig temp file

  • 8/10/2019 Making Pig Fly

    14/36 Hortonworks Inc. 2011

    Enable temp file compression

    Page 14Architecting the Future of Big Data

    Pig temp file are not compressed bydefaultIssues with snappy (HADOOP-7990)LZO: not Apache license

    Enable LZO compressionInstall LZO for HadoopIn conf/pig.properties

    With lzo, up to > 90% disk saving and 4x queryspeed up

    pig.tmpfilecompression = truepig.tmpfilecompression.codec = lzo

  • 8/10/2019 Making Pig Fly

    15/36 Hortonworks Inc. 2011

    Multiquery

    Page 15Architecting the Future of Big Data

    Combine two or more map/reduce jobinto one

    Happens automaticallyCases we want to control multiquery: combine toomany

    Load

    Group by $0 Group by $1

    Foreach Foreach

    Store Store

    Group by $2

    Foreach

    Store

  • 8/10/2019 Making Pig Fly

    16/36

  • 8/10/2019 Making Pig Fly

    17/36 Hortonworks Inc. 2011

    Implement the right UDF

    Page 17Architecting the Future of Big Data

    Algebraic UDFInitialIntermediateFinal

    A = loadinput;B0 = groupA by$0;C0 = foreachB0 generategroup, SUM(A);StoreC0 intooutput0;

    MapInitial

    Combiner

    Intermediate

    ReduceFinal

  • 8/10/2019 Making Pig Fly

    18/36 Hortonworks Inc. 2011

    Implement the right UDF

    Page 18Architecting the Future of Big Data

    Accumulator UDFReduce side UDFNormally takes a bag

    BenefitBig bag are passed inbatchesAvoid using too muchmemory

    Batch size

    A = loadinput;B0 = groupA by$0;C0 = foreachB0 generategroup,my_accum(A);StoreC0 intooutput0;

    my_accum extendsAccumulator {publicvoidaccumulate() {// take a bag trunk

    }publicvoidgetValue() {

    // called after all bag trunks areprocessed

    }}

    pig.accumulative.batchsize=20000

  • 8/10/2019 Making Pig Fly

    19/36

    Hortonworks Inc. 2011

    Memory optimization

    Page 19Architecting the Future of Big Data

    Control bag size on reduce side

    If bag size exceed threshold, spill to disk

    Control the bag size to fit the bag in memory ifpossible

    reduce(Text key, Iteratorvalues, )

    Mapreduce:

    Iterator

    Bag of Input 1 Bag of Input 2 Bag of Input 3

    pig.cachedbag.memusage=0.2

  • 8/10/2019 Making Pig Fly

    20/36

    Hortonworks Inc. 2011

    Optimization starts before pig

    Page 20Architecting the Future of Big Data

    Input formatSerialization formatCompression

  • 8/10/2019 Making Pig Fly

    21/36

    Hortonworks Inc. 2011

    Input format -Test Query

    Page 21Architecting the Future of Big Data

    > searches = load aol_search_logs.txt'using PigStorage() as(ID, Query, );

    > search_thejas = filtersearches byQuery

    matches'.*thejas.*';

    > dump search_thejas;

    (1568578 , thejasminesupperclub, .)

  • 8/10/2019 Making Pig Fly

    22/36

    Hortonworks Inc. 2011

    Input formats

    Page 22Architecting the Future of Big Data

    RunTime (sec)

    PigStorage

    LzoPigStorage

    PigStorage W Type

    AvroStorage (has types)

  • 8/10/2019 Making Pig Fly

    23/36

    Hortonworks Inc. 2011

    Columnar format

    Page 23Architecting the Future of Big Data

    RCFileColumnar format for a group of rowsMore efficient if you query subset ofcolumns

  • 8/10/2019 Making Pig Fly

    24/36

    Hortonworks Inc. 2011

    Tests with RCFile

    Page 24Architecting the Future of Big Data

    Tests with load + project + filter out allrecords.Using hcatalog, w compression,types

    Test 1Project 1 out of 5 columnsTest 2

    Project all 5 columns

  • 8/10/2019 Making Pig Fly

    25/36

    Hortonworks Inc. 2011

    RCFile test results

    Page 25Architecting the Future of Big Data

    Plain Text

    RCFile

  • 8/10/2019 Making Pig Fly

    26/36

    Hortonworks Inc. 2011

    Cost based optimizations

    Page 26Architecting the Future of Big Data

    Optimizations decisions based onyour query/dataOften iterative process

    Runquery

    Measure

    Tune

  • 8/10/2019 Making Pig Fly

    27/36

    Hortonworks Inc. 2011

    Hash Based Agg

    Use pig.exec.mapPartAgg=true to enable

    Map task

    Cost based optimization - Aggregation

    Page 27Architecting the Future of Big Data

    Map

    (logic)

  • 8/10/2019 Making Pig Fly

    28/36

    Hortonworks Inc. 2011

    Cost based optimizationHash Agg.

    Page 28Architecting the Future of Big Data

    Auto off featureswitches off HBA if output reduction isnot good enough

    Configuring Hash AggConfigure auto off feature -pig.exec.mapPartAgg.minReduction

    Configure memory used -

    pig.cachedbag.memusage

  • 8/10/2019 Making Pig Fly

    29/36

    Hortonworks Inc. 2011

    Cost based optimization - Join

    Page 29Architecting the Future of Big Data

    Use appropriate join algorithmSkew on join key - Skew joinFits in memoryFR join

  • 8/10/2019 Making Pig Fly

    30/36

  • 8/10/2019 Making Pig Fly

    31/36

    Hortonworks Inc. 2011

    Parallelism of reduce tasks

    Page 31Architecting the Future of Big Data

    Runtime

    4.0

    6.0

    8.0

    24.0

    48.0

    256.0

    Number of reduce slots = 6Factors affecting runtimeCores simultaneously used/skewCost of having additional reduce tasks

    Cost based optimization keep data

  • 8/10/2019 Making Pig Fly

    32/36

    Hortonworks Inc. 2011

    Cost based optimizationkeep datasorted

    Page 32Architecting the Future of Big Data

    Frequent joins operations on samekeysKeep data sorted on keysUse merge join

    Optimized group on sorted keysWorks with few load functionsneedsadditional i/f implementation

  • 8/10/2019 Making Pig Fly

    33/36

    Hortonworks Inc. 2011

    Optimizations for sorted data

    Page 33Architecting the Future of Big Data

    Join 2

    Join 1

    Sort2

    Sort1

  • 8/10/2019 Making Pig Fly

    34/36

    Hortonworks Inc. 2011

    Future Directions

    Page 34Architecting the Future of Big Data

    Optimize using statsUsing historical stats w hcatalogSampling

  • 8/10/2019 Making Pig Fly

    35/36

    Hortonworks Inc. 2011

    Questions

    Page 35Architecting the Future of Big Data

    ?

  • 8/10/2019 Making Pig Fly

    36/36