how to make your map-reduce jobs perform as well as pig: lessons from pig optimizations
DESCRIPTION
How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations. Thejas Nair pig team @ Yahoo! Apache pig PMC member. http://pig.apache.org. What is Pig?. An engine that executes Pig Latin locally or on a Hadoop cluster. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations](https://reader036.vdocument.in/reader036/viewer/2022062314/56812c77550346895d9119f2/html5/thumbnails/1.jpg)
How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations
http://pig.apache.org
Thejas Nair
pig team @ Yahoo!
Apache pig PMC member
![Page 2: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations](https://reader036.vdocument.in/reader036/viewer/2022062314/56812c77550346895d9119f2/html5/thumbnails/2.jpg)
What is Pig?
Pig Latin, a high level data processing language.
An engine that executes Pig Latin locally or on a Hadoop cluster.
Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/
![Page 3: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations](https://reader036.vdocument.in/reader036/viewer/2022062314/56812c77550346895d9119f2/html5/thumbnails/3.jpg)
Pig Latin example
Users = load ‘users’ as (name, age);
Fltrd = filter Users by age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);
Jnd = join Fltrd by name, Pages by user;
![Page 4: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations](https://reader036.vdocument.in/reader036/viewer/2022062314/56812c77550346895d9119f2/html5/thumbnails/4.jpg)
Comparison with MR in Java
020406080
100120140160180
Hadoop Pig
1/20 the lines of code
0
50
100
150
200
250
300
Hadoop Pig
Minutes
1/16 the development time
What about Performance ?
![Page 5: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations](https://reader036.vdocument.in/reader036/viewer/2022062314/56812c77550346895d9119f2/html5/thumbnails/5.jpg)
Pig Compared to Map Reduce
• Faster development time• Data flow versus programming logic• Many standard data operations (e.g. join)
included• Manages all the details of connecting
jobs and data flow• Copes with Hadoop version change
issues
![Page 6: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations](https://reader036.vdocument.in/reader036/viewer/2022062314/56812c77550346895d9119f2/html5/thumbnails/6.jpg)
And, You Don’t Lose Power
• UDFs can be used to load, evaluate, aggregate, and store data
• External binaries can be invoked
• Metadata is optional
• Flexible data model
• Nested data types
• Explicit data flow programming
![Page 7: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations](https://reader036.vdocument.in/reader036/viewer/2022062314/56812c77550346895d9119f2/html5/thumbnails/7.jpg)
Pig performance
• Pigmix : pig vs mapreduce
![Page 8: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations](https://reader036.vdocument.in/reader036/viewer/2022062314/56812c77550346895d9119f2/html5/thumbnails/8.jpg)
Pig optimization principles
• vs RDBMS: There is absence of accurate models for data, operators and execution env
• Use available reliable info. Trust user choice.
• Use rules that help in most cases
• Rules based on runtime information
![Page 9: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations](https://reader036.vdocument.in/reader036/viewer/2022062314/56812c77550346895d9119f2/html5/thumbnails/9.jpg)
Logical Optimizations
Restructure given logical dataflow graph
• Apply filter, project, limit early
• Merge foreach, filter statements
• Operator rewrites
ScriptA = loadB = foreachC = filter
Logical PlanA -> B -> C
Parser Logical Optimizer
Optimized L. PlanA -> C -> B
![Page 10: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations](https://reader036.vdocument.in/reader036/viewer/2022062314/56812c77550346895d9119f2/html5/thumbnails/10.jpg)
Physical Optimizations
Physical plan: sequence of MR jobs having physical operators.
• Built-in rules. eg. use of combiner
• Specified in query - eg. join type
Optimized L. PlanX -> Y -> Z
Optimizer
Phy/MR planM(PX-PYm) R(PYr) -> M(Z)
Optimized Phy/MR Plan M(PX-PYm) C(PYc)R(PYr)->M(Z)
Translator
![Page 11: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations](https://reader036.vdocument.in/reader036/viewer/2022062314/56812c77550346895d9119f2/html5/thumbnails/11.jpg)
Hash Join
PagesPages UsersUsers
Users = load ‘users’ as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Users by name, Pages by user;
Map 1Map 1
Pagesblock nPagesblock n
Map 2Map 2
Usersblock mUsers
block m
Reducer 1Reducer 1
Reducer 2Reducer 2
(1, user)
(2, name)
(1, fred)(2, fred)(2, fred)
(1, jane)(2, jane)(2, jane)
![Page 12: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations](https://reader036.vdocument.in/reader036/viewer/2022062314/56812c77550346895d9119f2/html5/thumbnails/12.jpg)
Skew Join
PagesPages UsersUsers
Users = load ‘users’ as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Pages by user, Users by name using ‘skewed’;
Map 1Map 1
Pagesblock nPagesblock n
Map 2Map 2
Usersblock mUsers
block m
Reducer 1Reducer 1
Reducer 2Reducer 2
(1, user)
(2, name)
(1, fred, p1)(1, fred, p2)(2, fred)
(1, fred, p3)(1, fred, p4)(2, fred)
SPSP
SPSP
![Page 13: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations](https://reader036.vdocument.in/reader036/viewer/2022062314/56812c77550346895d9119f2/html5/thumbnails/13.jpg)
Merge Join
PagesPages UsersUsers
aaron . . . . . . . .zach
aaron . . . . . .zach
Users = load ‘users’ as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Pages by user, Users by name using ‘merge’;
Map 1Map 1
Map 2Map 2
UsersUsers
UsersUsers
PagesPages
PagesPages
aaron…amr
aaron…
amy…barb
amy…
![Page 14: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations](https://reader036.vdocument.in/reader036/viewer/2022062314/56812c77550346895d9119f2/html5/thumbnails/14.jpg)
Replicated Join
PagesPagesUsersUsersaaron
aaron . . . . . . .zach
aaron .zach
Users = load ‘users’ as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Pages by user, Users by name using ‘replicated’;
Map 1Map 1
Map 2Map 2
UsersUsersPagesPages
PagesPages
aaron…amr
aaron . zach
amy…barb
UsersUsersaaron . zach
![Page 15: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations](https://reader036.vdocument.in/reader036/viewer/2022062314/56812c77550346895d9119f2/html5/thumbnails/15.jpg)
Group/cogroup optimizations• On sorted and ‘collected’ data grp = group Users by name using ‘collected’;
PagesPages
aaronaaronbarneycarol . . . . . . .zach
Map 1Map 1
aaronaaronbarney
Map 2Map 2
carol . .
![Page 16: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations](https://reader036.vdocument.in/reader036/viewer/2022062314/56812c77550346895d9119f2/html5/thumbnails/16.jpg)
Multi-store scriptA = load ‘users’ as (name, age, gender, city, state);B = filter A by name is not null;C1 = group B by age, gender;D1 = foreach C1 generate group, COUNT(B);store D into ‘bydemo’;C2= group B by state;D2 = foreach C2 generate group, COUNT(B);store D2 into ‘bystate’;
A: loadA: load B: filterB: filter
C2: groupC2: group
C1: groupC1: group
C3: eval udfC3: eval udf
C2: eval udfC2: eval udf
store into ‘bystate’store into ‘bystate’
store into ‘bydemo’store into ‘bydemo’
![Page 17: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations](https://reader036.vdocument.in/reader036/viewer/2022062314/56812c77550346895d9119f2/html5/thumbnails/17.jpg)
Multi-Store Map-Reduce Planmapmap filterfilter
local rearrangelocal rearrange
splitsplit
local rearrangelocal rearrange
reducereduce
multiplexmultiplexpackagepackage packagepackage
foreachforeach foreachforeach
![Page 18: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations](https://reader036.vdocument.in/reader036/viewer/2022062314/56812c77550346895d9119f2/html5/thumbnails/18.jpg)
Memory Management
Use disk if large objects don’t fit into memory
• JVM limit > phy mem - Very poor performance
• Spill on memory threshold notification from JVM - unreliable
• pre-set limit for large bags. Custom spill logic for different bags -eg distinct bag.
![Page 19: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations](https://reader036.vdocument.in/reader036/viewer/2022062314/56812c77550346895d9119f2/html5/thumbnails/19.jpg)
Other optimizations
• Aggressive use of combiner, secondary sort
• Lazy deserialization in loaders
• Better serialization format
• Faster regex lib, compiled pattern
• Compression between MR jobs
![Page 20: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations](https://reader036.vdocument.in/reader036/viewer/2022062314/56812c77550346895d9119f2/html5/thumbnails/20.jpg)
Future optimization work
• Improve memory management
• Join + group in single MR, if same keys used
• Even better skew handling
• Adaptive optimizations
• Automated hadoop tuning
• …
![Page 21: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations](https://reader036.vdocument.in/reader036/viewer/2022062314/56812c77550346895d9119f2/html5/thumbnails/21.jpg)
Pig - fast and flexible
More flexibility in 0.8, 0.9
• Udfs in scripting languages (python)
• MR job as relation
• Relation as scalar• Turing complete pig (0.9)
Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/
![Page 22: How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations](https://reader036.vdocument.in/reader036/viewer/2022062314/56812c77550346895d9119f2/html5/thumbnails/22.jpg)
Further reading
• Docs - http://pig.apache.org/docs/r0.7.0/
• Papers and talks - http://wiki.apache.org/pig/PigTalksPapers
• Training videos in vimeo.com (search ‘hadoop pig’)