making pig fly
TRANSCRIPT
-
8/10/2019 Making Pig Fly
1/36
Hortonworks Inc. 2011
Daniel Dai (@daijy)Thejas Nair (@thejasn)
Page 1
Making Pig FlyOptimizing Data Processing on Hadoop
-
8/10/2019 Making Pig Fly
2/36
Hortonworks Inc. 2011
What is Apache Pig?
Page 2Architecting the Future of Big Data
Pig Latin, a high leveldata processinglanguage.
An engine that
executes Pig Latinlocally or on aHadoop cluster.
Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/
-
8/10/2019 Making Pig Fly
3/36
Hortonworks Inc. 2011
Pig-latin example
Page 3Architecting the Future of Big Data
Query : Get the list of web pages visited by userswhoseage is between 20 and 29 years.
USERS = loadusersas (uid, age);
USERS_20s = filterUSERS by age >= 20 andage
-
8/10/2019 Making Pig Fly
4/36
Hortonworks Inc. 2011
Why pig ?
Page 4Architecting the Future of Big Data
Faster developmentFewer lines of codeDont re-invent the wheel
Flexible
Metadata is optionalExtensibleProcedural programming
Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/
-
8/10/2019 Making Pig Fly
5/36
Hortonworks Inc. 2011
Pig optimizations
Page 5Architecting the Future of Big Data
Ideally user should not have to botherReality
Pig is still young and immaturePig does not have the whole picture
Cluster configurationData histogram
Pig philosophy: Pig is docile
-
8/10/2019 Making Pig Fly
6/36
Hortonworks Inc. 2011
Pig optimizations
Page 6Architecting the Future of Big Data
What pig does for youDo safe transformations of query to optimizeOptimized operations (join, sort)
What you doOrganize input in optimal wayOptimize pig-latin queryTell pig what join/group algorithm to use
-
8/10/2019 Making Pig Fly
7/36 Hortonworks Inc. 2011
Rule based optimizer
Page 7Architecting the Future of Big Data
Column prunerPush up filterPush down flatten
Push up limitPartition pruningGlobal optimizer
-
8/10/2019 Making Pig Fly
8/36 Hortonworks Inc. 2011
Column Pruner
Page 8Architecting the Future of Big Data
Pig will do column pruning automatically
Cases Pig will not do column pruningautomatically
No schema specified in load statement
A = loadinput as(a0, a1, a2);B = foreachA generatea0+a1;C = orderB by$0;StoreC intooutput;
Pig will prunea2 automatically
A = loadinput;B = orderA by$0;C = foreachB generate$0+$1;StoreC intooutput;
A = loadinput;A1 = foreachA generate$0, $1;B = orderA1 by$0;C = foreachB generate$0+$1;StoreC intooutput;
DIY
-
8/10/2019 Making Pig Fly
9/36 Hortonworks Inc. 2011
Column Pruner
Page 9Architecting the Future of Big Data
Another case Pig does not do columnpruningPig does not keep track of unused column aftergrouping
A = loadinput as(a0, a1, a2);B = groupAby a0;C = foreach B generate SUM(A.a1);Store C into output;
DIY
A = loadinput as(a0, a1, a2);A1 = foreachA generate$0, $1;B = groupA1by a0;C = foreach B generate SUM(A.a1);Store C into output;
-
8/10/2019 Making Pig Fly
10/36 Hortonworks Inc. 2011
Push up filter
Page 10Architecting the Future of Big Data
Pig split the filter condition before push
A
Join
a0>0 && b0>10
B
Filter
A
Join
a0>0
B
Filter b0>10
Original query Split filter condition
A
Join
a0>0
B
Filter b0>10
Push up filter
-
8/10/2019 Making Pig Fly
11/36 Hortonworks Inc. 2011
Other push up/down
Page 11Architecting the Future of Big Data
Push down flatten
Push up limit
Load
Flatten
Order
Load
Flatten
Order
-
8/10/2019 Making Pig Fly
12/36 Hortonworks Inc. 2011
Partition pruning
Page 12Architecting the Future of Big Data
Prune unnecessary partitions entirelyHCatLoader
2010
2011
2012
HCatLoaderFilter
(year>=2011)
2010
2011
2012
HCatLoader(year>=2011)
-
8/10/2019 Making Pig Fly
13/36 Hortonworks Inc. 2011
Intermediate file compression
Page 13Architecting the Future of Big Data
Pig Script
map 1
reduce 1
map 2
reduce 2
Pig temp file
map 3
reduce 3
Pig temp file
-
8/10/2019 Making Pig Fly
14/36 Hortonworks Inc. 2011
Enable temp file compression
Page 14Architecting the Future of Big Data
Pig temp file are not compressed bydefaultIssues with snappy (HADOOP-7990)LZO: not Apache license
Enable LZO compressionInstall LZO for HadoopIn conf/pig.properties
With lzo, up to > 90% disk saving and 4x queryspeed up
pig.tmpfilecompression = truepig.tmpfilecompression.codec = lzo
-
8/10/2019 Making Pig Fly
15/36 Hortonworks Inc. 2011
Multiquery
Page 15Architecting the Future of Big Data
Combine two or more map/reduce jobinto one
Happens automaticallyCases we want to control multiquery: combine toomany
Load
Group by $0 Group by $1
Foreach Foreach
Store Store
Group by $2
Foreach
Store
-
8/10/2019 Making Pig Fly
16/36
-
8/10/2019 Making Pig Fly
17/36 Hortonworks Inc. 2011
Implement the right UDF
Page 17Architecting the Future of Big Data
Algebraic UDFInitialIntermediateFinal
A = loadinput;B0 = groupA by$0;C0 = foreachB0 generategroup, SUM(A);StoreC0 intooutput0;
MapInitial
Combiner
Intermediate
ReduceFinal
-
8/10/2019 Making Pig Fly
18/36 Hortonworks Inc. 2011
Implement the right UDF
Page 18Architecting the Future of Big Data
Accumulator UDFReduce side UDFNormally takes a bag
BenefitBig bag are passed inbatchesAvoid using too muchmemory
Batch size
A = loadinput;B0 = groupA by$0;C0 = foreachB0 generategroup,my_accum(A);StoreC0 intooutput0;
my_accum extendsAccumulator {publicvoidaccumulate() {// take a bag trunk
}publicvoidgetValue() {
// called after all bag trunks areprocessed
}}
pig.accumulative.batchsize=20000
-
8/10/2019 Making Pig Fly
19/36
Hortonworks Inc. 2011
Memory optimization
Page 19Architecting the Future of Big Data
Control bag size on reduce side
If bag size exceed threshold, spill to disk
Control the bag size to fit the bag in memory ifpossible
reduce(Text key, Iteratorvalues, )
Mapreduce:
Iterator
Bag of Input 1 Bag of Input 2 Bag of Input 3
pig.cachedbag.memusage=0.2
-
8/10/2019 Making Pig Fly
20/36
Hortonworks Inc. 2011
Optimization starts before pig
Page 20Architecting the Future of Big Data
Input formatSerialization formatCompression
-
8/10/2019 Making Pig Fly
21/36
Hortonworks Inc. 2011
Input format -Test Query
Page 21Architecting the Future of Big Data
> searches = load aol_search_logs.txt'using PigStorage() as(ID, Query, );
> search_thejas = filtersearches byQuery
matches'.*thejas.*';
> dump search_thejas;
(1568578 , thejasminesupperclub, .)
-
8/10/2019 Making Pig Fly
22/36
Hortonworks Inc. 2011
Input formats
Page 22Architecting the Future of Big Data
RunTime (sec)
PigStorage
LzoPigStorage
PigStorage W Type
AvroStorage (has types)
-
8/10/2019 Making Pig Fly
23/36
Hortonworks Inc. 2011
Columnar format
Page 23Architecting the Future of Big Data
RCFileColumnar format for a group of rowsMore efficient if you query subset ofcolumns
-
8/10/2019 Making Pig Fly
24/36
Hortonworks Inc. 2011
Tests with RCFile
Page 24Architecting the Future of Big Data
Tests with load + project + filter out allrecords.Using hcatalog, w compression,types
Test 1Project 1 out of 5 columnsTest 2
Project all 5 columns
-
8/10/2019 Making Pig Fly
25/36
Hortonworks Inc. 2011
RCFile test results
Page 25Architecting the Future of Big Data
Plain Text
RCFile
-
8/10/2019 Making Pig Fly
26/36
Hortonworks Inc. 2011
Cost based optimizations
Page 26Architecting the Future of Big Data
Optimizations decisions based onyour query/dataOften iterative process
Runquery
Measure
Tune
-
8/10/2019 Making Pig Fly
27/36
Hortonworks Inc. 2011
Hash Based Agg
Use pig.exec.mapPartAgg=true to enable
Map task
Cost based optimization - Aggregation
Page 27Architecting the Future of Big Data
Map
(logic)
-
8/10/2019 Making Pig Fly
28/36
Hortonworks Inc. 2011
Cost based optimizationHash Agg.
Page 28Architecting the Future of Big Data
Auto off featureswitches off HBA if output reduction isnot good enough
Configuring Hash AggConfigure auto off feature -pig.exec.mapPartAgg.minReduction
Configure memory used -
pig.cachedbag.memusage
-
8/10/2019 Making Pig Fly
29/36
Hortonworks Inc. 2011
Cost based optimization - Join
Page 29Architecting the Future of Big Data
Use appropriate join algorithmSkew on join key - Skew joinFits in memoryFR join
-
8/10/2019 Making Pig Fly
30/36
-
8/10/2019 Making Pig Fly
31/36
Hortonworks Inc. 2011
Parallelism of reduce tasks
Page 31Architecting the Future of Big Data
Runtime
4.0
6.0
8.0
24.0
48.0
256.0
Number of reduce slots = 6Factors affecting runtimeCores simultaneously used/skewCost of having additional reduce tasks
Cost based optimization keep data
-
8/10/2019 Making Pig Fly
32/36
Hortonworks Inc. 2011
Cost based optimizationkeep datasorted
Page 32Architecting the Future of Big Data
Frequent joins operations on samekeysKeep data sorted on keysUse merge join
Optimized group on sorted keysWorks with few load functionsneedsadditional i/f implementation
-
8/10/2019 Making Pig Fly
33/36
Hortonworks Inc. 2011
Optimizations for sorted data
Page 33Architecting the Future of Big Data
Join 2
Join 1
Sort2
Sort1
-
8/10/2019 Making Pig Fly
34/36
Hortonworks Inc. 2011
Future Directions
Page 34Architecting the Future of Big Data
Optimize using statsUsing historical stats w hcatalogSampling
-
8/10/2019 Making Pig Fly
35/36
Hortonworks Inc. 2011
Questions
Page 35Architecting the Future of Big Data
?
-
8/10/2019 Making Pig Fly
36/36