example generation for data flow programs
DESCRIPTION
TRANSCRIPT
![Page 1: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/1.jpg)
Chris Olston Shubham ChopraUtkarsh Srivastava
Generating Example Data For Dataflow Programs
Research
![Page 2: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/2.jpg)
Data Processing Renaissance
Lots of data (TBs/day at Yahoo!)
Lots of queries and programs to analyze that data
New data flow languages Map-Reduce, Pig Latin, Dryad
Other data flow systems Aurora, Tioga, River
![Page 3: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/3.jpg)
Example Dataflow Program
LOAD(user, url)
LOAD(url, pagerank)
TRANSFORMuser, canonicalize(url)
JOINon url
GROUPon user
TRANSFORMuser, AVG(pagerank)
FILTERavgPR> 0.5
Find users that tend to visit
high-pagerank pages
![Page 4: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/4.jpg)
Iterative Process
LOAD(user, url)
LOAD(url, pagerank)
TRANSFORMuser, canonicalize(url)
JOINon url
GROUPon user
TRANSFORMuser, AVG(pagerank)
FILTERavgPR> 0.5
Bug in UDFcanonicalize?
Joining on right attribute?
Everything being filtered out?
No Output
![Page 5: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/5.jpg)
How to do test runs?
• Run with real data– Too inefficient (TBs of data)
• Create smaller data sets (e.g., by sampling)– Empty results due to joins [Chaudhuri et. al. 99], and
selective filters
• Biased sampling for joins– Indexes not always present
![Page 6: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/6.jpg)
Examples to Illustrate Program
LOAD(user, url)
LOAD(url, pagerank)
TRANSFORMuser, canonicalize(url)
JOINon url
GROUPon user
TRANSFORMuser, AVG(pagerank)
FILTERavgPR> 0.5
(Amy, cnn.com) (Amy, http://www.frogs.com)(Fred, www.snails.com/index.html)
(Amy, www.cnn.com) (Amy, www.frogs.com)(Fred, www.snails.com)
(www.cnn.com, 0.9) (www.frogs.com, 0.3)(www.snails.com, 0.4)
(Amy, www.cnn.com, 0.9) (Amy, www.frogs.com, 0.3)(Fred, www.snails.com, 0.4)
(Amy, 0.6) (Fred, 0.4)
(Amy, 0.6)
(Amy, www.cnn.com, 0.9) (Amy, www.frogs.com, 0.3)
(Fred, www.snails.com, 0.4)
( Amy,
( Fred, )
)
![Page 7: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/7.jpg)
Value Addition From Examples
• Examples can be used for
– Debugging
– Understanding a program written by someone else
– Learning a new operator, or language
![Page 8: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/8.jpg)
Outline
• Formalization of good examples
• Example Generation Algorithm
• Performance Evaluation
![Page 9: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/9.jpg)
Good Examples: Consistency
LOAD(user, url)
LOAD(url, pagerank)
TRANSFORMuser, canonicalize(url)
JOINon url
GROUPon user
TRANSFORMuser, AVG(pagerank)
FILTERavgPR> 0.5
0. Consistency
output example =
operator applied on input example
(Amy, cnn.com) (Amy, http://www.frogs.com)(Fred, www.snails.com/index.html)
(Amy, www.cnn.com) (Amy, www.frogs.com)(Fred, www.snails.com)
![Page 10: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/10.jpg)
Good Examples: Realism
LOAD(user, url)
LOAD(url, pagerank)
TRANSFORMuser, canonicalize(url)
JOINon url
GROUPon user
TRANSFORMuser, AVG(pagerank)
FILTERavgPR> 0.5
1. Realism
Formalization: Fraction of examples that are real or are derived from real records
(Amy, cnn.com) (Amy, http://www.frogs.com)(Fred, www.snails.com/index.html)
(Amy, www.cnn.com) (Amy, www.frogs.com)(Fred, www.snails.com)
![Page 11: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/11.jpg)
Good Examples: Completeness
LOAD(user, url)
LOAD(url, pagerank)
TRANSFORMuser, canonicalize(url)
JOINon url
GROUPon user
TRANSFORMuser, AVG(pagerank)
FILTERavgPR> 0.5
Demonstrate the salient properties of each operator,
e.g., FILTER
2. Completeness
(Amy, 0.6) (Fred, 0.4)
(Amy, 0.6)
![Page 12: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/12.jpg)
Good Examples: Completeness
LOAD(user, url)
LOAD(url, pagerank)
TRANSFORMuser, canonicalize(url)
JOINon url
GROUPon user
TRANSFORMuser, AVG(pagerank)
FILTERavgPR> 0.5
Demonstrate the salient properties of each operator,
e.g., JOIN
2. Completeness(Amy, www.cnn.com) (Amy, www.frogs.com)(Fred, www.snails.com)
(www.cnn.com, 0.9) (www.frogs.com, 0.3)(www.snails.com, 0.4)
(Amy, www.cnn.com, 0.9) (Amy, www.frogs.com, 0.3)(Fred, www.snails.com, 0.4)
![Page 13: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/13.jpg)
Formalizing Completeness
• For any operator, classify input/output example records into equivalence classes.
• Each equivalence class demonstrates one property of the operator.
• Try to have at least one example from each class
![Page 14: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/14.jpg)
Equivalence Class Examples
FILTERE0: All input records that pass the filter
E1: All input records that fail the filter
JOINE0: All output records
UNIONE0: All records belonging to first input
E1: All records belonging to second input
![Page 15: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/15.jpg)
Formalizing Completeness
Operator Completeness: Fraction of equivalence classes that have at least one example record.
Overall Completeness: Average of per-operator completeness.
![Page 16: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/16.jpg)
# equivalence classes# example records
Good Examples: Conciseness
LOAD(user, url)
LOAD(url, pagerank)
TRANSFORMuser, canonicalize(url)
JOINon url
GROUPon user
TRANSFORMuser, AVG(pagerank)
FILTERavgPR> 0.5
3. Conciseness
Operator Conciseness:
Overall Conciseness:Average of per-operator conciseness
(Amy, cnn.com) (Amy, http://www.frogs.com)(Fred, www.snails.com/index.html)
(Amy, www.cnn.com) (Amy, www.frogs.com)(Fred, www.snails.com)
![Page 17: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/17.jpg)
Outline
• Formalization of good examples
• Example Generation Algorithm
• Performance Evaluation
![Page 18: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/18.jpg)
Related Work
Related Areas:– Reverse Query Processing– Database Testing– Software and Hardware Verification
• Differences– Realism not a concern– Notion of conciseness is different– Intermediate result size is immaterial
![Page 19: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/19.jpg)
Strawman I: Downstream Propagation
Take some portion of input data and run the program over it.
1. Realism
2. Completeness
3. Conciseness
![Page 20: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/20.jpg)
Strawman II: Upstream Propagation
Start from what output is desired, and work backwards
1. Realism
2. Completeness
3. Conciseness
![Page 21: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/21.jpg)
Our Algorithm
Algorithm Passes1. Downstream 2. Pruning3. Upstream4. Pruning
![Page 22: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/22.jpg)
Our Algorithm
Take a subset of input and propagate through the
program.
LOAD(user, age)
FILTERudf(user)
LOAD(user, age)
UNION FILTERage>18
(Amy, 20) (Fred, 25)
(Amy, 20) (Fred, 25)
(Jack, 30)
(Amy, 20) (Fred, 25)(Jack, 30)
(Amy, 20) (Fred, 25)(Jack, 30)
Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning
![Page 23: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/23.jpg)
Our Algorithm
LOAD(user, age)
FILTERudf(user)
LOAD(user, age)
UNION FILTERage>18
(Amy, 20) (Fred, 25)
(Amy, 20) (Fred, 25)
(Jack, 30)
(Amy, 20) (Fred, 25)(Jack, 30)
(Amy, 20) (Fred, 25)(Jack, 30)
Prune redundant examples, i.e., improve conciseness without
hurting completeness.
Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning
![Page 24: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/24.jpg)
Our Algorithm
LOAD(user, age)
FILTERudf(user)
LOAD(user, age)
UNION FILTERage>18
(Amy, 20) (Fred, 25)
(Jack, 30)
(Amy, 20) (Fred, 25)(Jack, 30)
Prune redundant examples, i.e., improve conciseness without
hurting completeness.
Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning
(Amy, 20) (Fred, 25)
(Amy, 20) (Fred, 25)(Jack, 30)
![Page 25: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/25.jpg)
Our Algorithm
LOAD(user, age)
FILTERudf(user)
LOAD(user, age)
UNION FILTERage>18
(Amy, 20)
(Jack, 30)
(Amy, 20) (Amy, 20) (Jack, 30)
(Amy, 20) (Jack, 30)
Prune redundant examples, i.e., improve conciseness without
hurting completeness.
Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning
![Page 26: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/26.jpg)
Formalization of Pruning
Example Records Elements Equivalence Classes Sets
Pick minimum #records to cover every equivalence
classSet-Cover Problem
• More involved because completeness of other operators must be maintained; details in paper
![Page 27: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/27.jpg)
Our Algorithm
LOAD(user, age)
FILTERudf(user)
LOAD(user, age)
UNION FILTERage>18
(Amy, 20)
(Jack, 30)
(Amy, 20) (Amy, 20) (Jack, 30)
(Amy, 20) (Jack, 30)
Enhance completeness by inserting constraint records
(best effort; details in paper)
Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning
![Page 28: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/28.jpg)
Our Algorithm
LOAD(user, age)
FILTERudf(user)
LOAD(user, age)
UNION FILTERage>18
(Amy, 20)
(Jack, 30)
(Amy, 20) (Amy, 20) (Jack, 30)
(Amy, 20) (Jack, 30)
(--, 17)
Enhance completeness by inserting constraint records
(best effort; details in paper)
Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning
![Page 29: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/29.jpg)
Our Algorithm
LOAD(user, age)
FILTERudf(user)
LOAD(user, age)
UNION FILTERage>18
(Amy, 20)(Bill, 17)
(Jack, 30)(--, 17)
(Amy, 20)(--, 17)
(Amy, 20) (Jack, 30)
(Amy, 20) (Jack, 30)(--, 17)
Enhance completeness by inserting constraint records
(best effort; details in paper)
Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning
![Page 30: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/30.jpg)
Our Algorithm
LOAD(user, age)
FILTERudf(user)
LOAD(user, age)
UNION FILTERage>18
(Amy, 20)(Bill, 17)
(Jack, 30)(Bob, 17)
(Amy, 20)(Bill, 17)
(Amy, 20) (Jack, 30)
(Amy, 20) (Jack, 30)(Bill, 17)(Bob, 17)
Enhance completeness by inserting constraint records
(best effort; details in paper)
Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning
![Page 31: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/31.jpg)
Our Algorithm
LOAD(user, age)
FILTERudf(user)
LOAD(user, age)
UNION FILTERage>18
(Amy, 20)(Bill, 17)
(Jack, 30)(Bob, 17)
(Amy, 20)(Bill, 17)
(Amy, 20) (Jack, 30)
(Amy, 20) (Jack, 30)(Bill, 17)(Bob, 17)
Prune redundant examples (as in Pass 2). Favor real examples
over synthetic ones.
Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning
![Page 32: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/32.jpg)
Our Algorithm
LOAD(user, age)
FILTERudf(user)
LOAD(user, age)
UNION FILTERage>18
(Amy, 20)(Bill, 17)
(Jack, 30)(Bob, 17)
(Amy, 20)(Bill, 17)
(Amy, 20) (Jack, 30)
(Amy, 20) (Jack, 30)(Bill, 17)(Bob, 17)
Prune redundant examples (as in Pass 2). Favor real examples
over synthetic ones.
Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning
![Page 33: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/33.jpg)
Our Algorithm
LOAD(user, age)
FILTERudf(user)
LOAD(user, age)
UNION FILTERage>18
(Bill, 17)
(Jack, 30)
(Bill, 17)(Jack, 30)(Jack, 30)
(Bill, 17)
Prune redundant examples (as in Pass 2). Favor real examples
over synthetic ones.
Algorithm Passes1. Downstream2. Pruning3. Upstream4. Pruning
![Page 34: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/34.jpg)
Implementation Status
• Available as ILLUSTRATE command in open-source release of Pig
• Available as Eclipse Plugin (PigPen)
![Page 35: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/35.jpg)
PigPen Snapshot
![Page 36: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/36.jpg)
Performance Evaluation
Program I: (Web Search Result Viewing Statistics)
– LOAD
– FILTER by compound arithmetic expression
– GROUP
– TRANSFORM using built-in aggregate function
![Page 37: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/37.jpg)
Performance on Program I
downstream upstream our algorithm0
0.25
0.5
0.75
1realism conciseness completeness
![Page 38: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/38.jpg)
Performance Evaluation
Program II: (Web Advertising Activity)
– LOAD table A
– FILTER A by compound logical expression
– JOIN with table B (highly selective)
– TRANSFORM using 4 string manipulation UDFS (non-invertible)
![Page 39: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/39.jpg)
Performance on Program II
downstream upstream our algorithm0
0.25
0.5
0.75
1realism conciseness completeness
![Page 40: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/40.jpg)
Running Time
P1 P2 P3 P4 P5 P6 P7 P80
0.51
1.52
2.53
3.54
downstream upstream our algorithm
runn
ing
time
(sec
onds
)
![Page 41: Example Generation for Data Flow Programs](https://reader036.vdocument.in/reader036/viewer/2022081507/5538c6b04a7959b26f8b4864/html5/thumbnails/41.jpg)
Conclusions
• Writing dataflow programs is an iterative process.
• Actual dataset too large for test runs.
• Our algorithm can automatically generate examples that illustrate the program through:• Realism• Conciseness• Completeness
Research