provenance for generalized map and reduce workflows
DESCRIPTION
Provenance for Generalized Map and Reduce Workflows. Robert Ikeda , Hyunjung Park, Jennifer Widom Stanford University. Pei Zhang Yue Lu. Provenance. Where data came from How it was derived, manipulated, combined, processed, … How it has evolved over time Uses: Explanation - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Provenance for Generalized Map and Reduce Workflows](https://reader035.vdocument.in/reader035/viewer/2022062304/56812e43550346895d93cf84/html5/thumbnails/1.jpg)
Provenance for GeneralizedProvenance for GeneralizedMap and Reduce WorkflowsMap and Reduce Workflows
Robert Ikeda, Hyunjung Park, Jennifer WidomStanford University
Pei ZhangPei ZhangYue LuYue Lu
![Page 2: Provenance for Generalized Map and Reduce Workflows](https://reader035.vdocument.in/reader035/viewer/2022062304/56812e43550346895d93cf84/html5/thumbnails/2.jpg)
Robert Ikeda
Provenance
Where data came from
How it was derived, manipulated, combined, processed, …
How it has evolved over time
Uses: Explanation Debugging and verification Recomputation
2
![Page 3: Provenance for Generalized Map and Reduce Workflows](https://reader035.vdocument.in/reader035/viewer/2022062304/56812e43550346895d93cf84/html5/thumbnails/3.jpg)
Robert Ikeda
The Panda Environment
Data-oriented workflows Graph of processing nodes Data sets on edges Statically-defined; batch execution; acyclic
3
In
I1… O
![Page 4: Provenance for Generalized Map and Reduce Workflows](https://reader035.vdocument.in/reader035/viewer/2022062304/56812e43550346895d93cf84/html5/thumbnails/4.jpg)
Robert Ikeda
Provenance
Backward tracing Find the input subsets that contributed to a
given output element
Forward tracing Determine which output elements were derived
from a particular input element
4
TwitterPosts
TwitterPosts
MovieSentimen
ts
MovieSentimen
ts
![Page 5: Provenance for Generalized Map and Reduce Workflows](https://reader035.vdocument.in/reader035/viewer/2022062304/56812e43550346895d93cf84/html5/thumbnails/5.jpg)
Robert Ikeda
Provenance
Basic idea Capture provenance one node at a time
(lazy or eager) Use it for backward and forward tracing Handle processing nodes of all types
5
![Page 6: Provenance for Generalized Map and Reduce Workflows](https://reader035.vdocument.in/reader035/viewer/2022062304/56812e43550346895d93cf84/html5/thumbnails/6.jpg)
Robert Ikeda
Generalized Map and Reduce Workflows
What if every nodewere a Map or Reduce function?
Provenance easier to define, capture, and exploit than in the general case
Transparent provenance capture in Hadoop Doesn’t interfere with parallelism or fault-tolerance
6
MM
R
MR
![Page 7: Provenance for Generalized Map and Reduce Workflows](https://reader035.vdocument.in/reader035/viewer/2022062304/56812e43550346895d93cf84/html5/thumbnails/7.jpg)
Robert Ikeda
Remainder of Talk
Defining Map and Reduce provenance
Recursive workflow provenance
Capturing and tracing provenance
System description and performance
Future work
7
![Page 8: Provenance for Generalized Map and Reduce Workflows](https://reader035.vdocument.in/reader035/viewer/2022062304/56812e43550346895d93cf84/html5/thumbnails/8.jpg)
Robert Ikeda
Remainder of Talk
Defining Map and Reduce provenance
Recursive workflow provenance
Capturing and tracing provenance
System description and performance
Future work
8
Surprising theoretical result Surprising theoretical result
Implementation details Implementation details
![Page 9: Provenance for Generalized Map and Reduce Workflows](https://reader035.vdocument.in/reader035/viewer/2022062304/56812e43550346895d93cf84/html5/thumbnails/9.jpg)
Robert Ikeda
Example
9
TweetsTweets
Diggs Diggs
TweetScan
DiggScan
Aggregate Filter
GoodMoviesGood
Movies
BadMovies
BadMovies
TM
DM
AM
![Page 10: Provenance for Generalized Map and Reduce Workflows](https://reader035.vdocument.in/reader035/viewer/2022062304/56812e43550346895d93cf84/html5/thumbnails/10.jpg)
Robert Ikeda
Transformation Properties
Deterministic Functions.
Multiplicity for Map Functions
Multiplicity for Reduce Functions
Monotonicity
10
![Page 11: Provenance for Generalized Map and Reduce Workflows](https://reader035.vdocument.in/reader035/viewer/2022062304/56812e43550346895d93cf84/html5/thumbnails/11.jpg)
Robert Ikeda
Map and Reduce Provenance
Map functions M(I) = UiI (M({i})) Provenance of oO is iI such that oM({i})
Reduce functions R(I) = U1≤ k ≤ n(R(Ik)) I1,…,In partition I on reduce-
key Provenance of oO is Ik I such that oR(Ik)
11
![Page 12: Provenance for Generalized Map and Reduce Workflows](https://reader035.vdocument.in/reader035/viewer/2022062304/56812e43550346895d93cf84/html5/thumbnails/12.jpg)
Robert Ikeda
Workflow Provenance
Intuitive recursive definition
Desirable “replay” property
o W(I*1,…, I*
n)
12
MM
R
MR
Usually holds, but not always Usually holds, but not always
o OI*
1
I*n In
I1E1
E2
…… O
![Page 13: Provenance for Generalized Map and Reduce Workflows](https://reader035.vdocument.in/reader035/viewer/2022062304/56812e43550346895d93cf84/html5/thumbnails/13.jpg)
Robert Ikeda
Replay Property Example
13
TweetScan
Summarize
CountTwitterPosts
TwitterPosts Inferred
Movie RatingsRating
Medians
#MoviesPer
Rating
#MoviesPer
Rating
M R R
Movie Rating
Avatar 8
Twilight
0
Twilight
2
Avatar 7
Twilight
7
Avatar 4
“Avatar was great”
“I hated Twilight”
“Twilight was pretty bad”
“I enjoyed Avatar”
“I loved Twilight”
“Avatar was okay”
“Avatar was great”
“I hated Twilight”
“Twilight was pretty bad”
“I enjoyed Avatar”
“I loved Twilight”
“Avatar was okay”
Movie Median
Avatar 7
Twilight 2
Median #Movies
2 1
7 1
![Page 14: Provenance for Generalized Map and Reduce Workflows](https://reader035.vdocument.in/reader035/viewer/2022062304/56812e43550346895d93cf84/html5/thumbnails/14.jpg)
Robert Ikeda
Replay Property Example
14
TweetScan
Summarize
CountTwitterPosts
TwitterPosts Inferred
Movie RatingsRating
Medians
#MoviesPer
Rating
#MoviesPer
Rating
M R R
Movie Rating
Avatar 8
Twilight
0
Twilight
2
Avatar 7
Twilight
7
Avatar 4
“Avatar was great”
“I hated Twilight”
“Twilight was pretty bad”
“I enjoyed Avatar”
“I loved Twilight”
“Avatar was okay”
“Avatar was great”
“I hated Twilight”
“Twilight was pretty bad”
“I enjoyed Avatar”
“I loved Twilight”
“Avatar was okay”
Movie Median
Avatar 7
Twilight 2
Median #Movies
2 1
7 1
![Page 15: Provenance for Generalized Map and Reduce Workflows](https://reader035.vdocument.in/reader035/viewer/2022062304/56812e43550346895d93cf84/html5/thumbnails/15.jpg)
Robert Ikeda
Replay Property Example
15
TweetScan
Summarize
CountTwitterPosts
TwitterPosts Inferred
Movie RatingsRating
Medians
#MoviesPer
Rating
#MoviesPer
Rating
M R R
Movie Rating
Avatar 8
Twilight
0
Twilight
2
Avatar 7
Twilight
7
Avatar 4
“Avatar was great”
“I hated Twilight”
“Twilight was pretty bad”
“I enjoyed AvatarAnd Twilight too”
“Avatar was okay”
“Avatar was great”
“I hated Twilight”
“Twilight was pretty bad”
“I enjoyed AvatarAnd Twilight too”
“Avatar was okay”
Movie Median
Avatar 7
Twilight 2
Median #Movies
2 1
7 1
![Page 16: Provenance for Generalized Map and Reduce Workflows](https://reader035.vdocument.in/reader035/viewer/2022062304/56812e43550346895d93cf84/html5/thumbnails/16.jpg)
Robert Ikeda
Replay Property Example
16
TweetScan
Summarize
CountTwitterPosts
TwitterPosts Inferred
Movie RatingsRating
Medians
#MoviesPer
Rating
#MoviesPer
Rating
M R R
Movie Rating
Avatar 8
Twilight
0
Twilight
2
Avatar 7
Twilight
7
Avatar 4
“Avatar was great”
“I hated Twilight”
“Twilight was pretty bad”
“I enjoyed AvatarAnd Twilight too”
“Avatar was okay”
“Avatar was great”
“I hated Twilight”
“Twilight was pretty bad”
“I enjoyed AvatarAnd Twilight too”
“Avatar was okay”
Movie Median
Avatar 7
Twilight 2
Median #Movies
2 1
7 1
![Page 17: Provenance for Generalized Map and Reduce Workflows](https://reader035.vdocument.in/reader035/viewer/2022062304/56812e43550346895d93cf84/html5/thumbnails/17.jpg)
Robert Ikeda
Replay Property Example
17
TweetScan
Summarize
CountTwitterPosts
TwitterPosts Inferred
Movie RatingsRating
Medians
#MoviesPer
Rating
#MoviesPer
Rating
M R R
Movie Rating
Avatar 8
Twilight
0
Twilight
2
Avatar 7
Twilight
7
Avatar 4
“Avatar was great”
“I hated Twilight”
“Twilight was pretty bad”
“I enjoyed AvatarAnd Twilight too”
“Avatar was okay”
“Avatar was great”
“I hated Twilight”
“Twilight was pretty bad”
“I enjoyed AvatarAnd Twilight too”
“Avatar was okay”
Movie Median
Avatar 7
Twilight 2
Median #Movies
2 1
7 17 2
One-ManyFunction
NonmonotonicReduce
NonmonotonicReduce
![Page 18: Provenance for Generalized Map and Reduce Workflows](https://reader035.vdocument.in/reader035/viewer/2022062304/56812e43550346895d93cf84/html5/thumbnails/18.jpg)
Robert Ikeda
Capturing and Tracing Provenance
Map functions Add the input ID to each of the output elements
Reduce functions Add the input reduce-key to each of the output
elements
Tracing Straightforward recursive algorithms
18
![Page 19: Provenance for Generalized Map and Reduce Workflows](https://reader035.vdocument.in/reader035/viewer/2022062304/56812e43550346895d93cf84/html5/thumbnails/19.jpg)
Robert Ikeda
RAMP System
Built as an extension to Hadoop
Supports MapReduce Workflows Each node is a MapReduce job
Provenance capture is transparent Retaining Hadoop’s parallel execution and fault
tolerance
Users need not be aware of provenance capture Wrapping is automatic RAMP stores provenance separately from the
input and output data
19
![Page 20: Provenance for Generalized Map and Reduce Workflows](https://reader035.vdocument.in/reader035/viewer/2022062304/56812e43550346895d93cf84/html5/thumbnails/20.jpg)
Robert Ikeda
RAMP System: Provenance Capture
Hadoop components Record-reader Mapper Combiner (optional) Reducer Record-writer
20
![Page 21: Provenance for Generalized Map and Reduce Workflows](https://reader035.vdocument.in/reader035/viewer/2022062304/56812e43550346895d93cf84/html5/thumbnails/21.jpg)
Robert Ikeda
RAMP System: Provenance Capture
21
RecordReaderRecordReader
MapperMapper
(ki, vi)
(km, vm)
Input
Map Output
Wrapper Wrapper
Wrapper Wrapper
RecordReaderRecordReader(ki, vi)
MapperMapper
(ki, 〈 vi, p 〉 )(ki, vi)(km, vm)(km, 〈 vm, p 〉 )
Input
Map Output
p
p
![Page 22: Provenance for Generalized Map and Reduce Workflows](https://reader035.vdocument.in/reader035/viewer/2022062304/56812e43550346895d93cf84/html5/thumbnails/22.jpg)
Robert Ikeda
RAMP System: Provenance Capture
22
ReducerReducer
RecordWriterRecordWriter
(ko, vo)
Map Output
Output
(km, [vm1,…,vmn])
Wrapper Wrapper
Wrapper Wrapper
ReducerReducer(ko, vo)
RecordWriterRecordWriter
(ko, 〈 vo, kmID 〉 )(ko, vo)
(km, [vm1,…,vmn])(km, [ 〈 vm1, p1 〉 ,…, 〈 vmn, pn 〉 ])
Map Output
Output
(kmID, pj)(q, kmID)Provenance
q
![Page 23: Provenance for Generalized Map and Reduce Workflows](https://reader035.vdocument.in/reader035/viewer/2022062304/56812e43550346895d93cf84/html5/thumbnails/23.jpg)
Robert Ikeda
Experiments
51 large EC2 instances (Thank you, Amazon!)
Two MapReduce “workflows” Wordcount• Many-one with large fan-in• Input sizes: 100, 300, 500 GB
Terasort• One-one• Input sizes: 93, 279, 466 GB
23
![Page 24: Provenance for Generalized Map and Reduce Workflows](https://reader035.vdocument.in/reader035/viewer/2022062304/56812e43550346895d93cf84/html5/thumbnails/24.jpg)
Robert Ikeda
Results: Wordcount
24
![Page 25: Provenance for Generalized Map and Reduce Workflows](https://reader035.vdocument.in/reader035/viewer/2022062304/56812e43550346895d93cf84/html5/thumbnails/25.jpg)
Robert Ikeda
Results: Terasort
25
![Page 26: Provenance for Generalized Map and Reduce Workflows](https://reader035.vdocument.in/reader035/viewer/2022062304/56812e43550346895d93cf84/html5/thumbnails/26.jpg)
Robert Ikeda
Summary of Results
Overhead of provenance capture Terasort• 20% time overhead, 21% space overhead
Wordcount• 76% time overhead, space overhead depends
directly on fan-in Backward-tracing
Terasort• 1.5 seconds for one element
Wordcount• Time directly dependent on fan-in
26
![Page 27: Provenance for Generalized Map and Reduce Workflows](https://reader035.vdocument.in/reader035/viewer/2022062304/56812e43550346895d93cf84/html5/thumbnails/27.jpg)
Robert Ikeda
Future Work
RAMP Selective provenance capture More efficient backward and forward tracing Indexing
General Incorporating SQL processing nodes
27
![Page 28: Provenance for Generalized Map and Reduce Workflows](https://reader035.vdocument.in/reader035/viewer/2022062304/56812e43550346895d93cf84/html5/thumbnails/28.jpg)
PANDAPANDAA System for Provenance and A System for Provenance and
DataData
“stanford panda”