bigdansing presentation slides for kaust

84
BigDansing: A BigDa ta Cleansing System By: Zuhair Khayyat InfoCloud group, Computer, Electrical and Mathematical Sciences and Engineering Division King Abdullah University of Science and Technology (KAUST)

Upload: zuhair-khayyat

Post on 21-Jan-2018

178 views

Category:

Science


1 download

TRANSCRIPT

Page 1: BigDansing presentation slides for KAUST

BigDansing: A BigData Cleansing System

By: Zuhair Khayyat

InfoCloud group, Computer, Electrical and Mathematical Sciences and Engineering Division

King Abdullah University of Science and Technology (KAUST)

Page 2: BigDansing presentation slides for KAUST

2

Page 3: BigDansing presentation slides for KAUST

3

Example of a Dirty Dataset

● Company employee database:

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

t7 Jon 60601 CH IL 40000 25

Page 4: BigDansing presentation slides for KAUST

4

Example of a Dirty Dataset

● Company employee database:

– Business rule: Any two employees in same Zipcode must be in same City.

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

t7 Jon 60601 CH IL 40000 25

Page 5: BigDansing presentation slides for KAUST

5

Example of a Dirty Dataset

● Company employee database:

– Business rule: Any two employees in same Zipcode must be in same City.

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

t7 Jon 60601 CH IL 40000 25

Page 6: BigDansing presentation slides for KAUST

6

Example of a Dirty Dataset

● Company employee database:

– Business rule: Any two employees in same Zipcode must be in same City.

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 LA CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

t7 Jon 60601 CH IL 40000 25

Page 7: BigDansing presentation slides for KAUST

7

Is Dirty Data a Real Problem?

● “Software expert Hollis Tibbets, the Global Director of Marketing at Dell, estimates that duplicate data and bad data combined cost the U.S. economy over $3 trillion every year”

● “duplicate and dirty data costs the healthcare industry over $300 billion every year.”

● Lost revenue, data repair costs.

– By: Joe Fusaro in www.ringlead.com/blog/dirty-data-costs-economy-3-trillion

Page 8: BigDansing presentation slides for KAUST

8

Is Dirty Data a Real Problem?

● “New research from Experian Data Quality shows that inaccurate data has a direct impact on the bottom line of 88% of companies, with the average company losing 12% of its revenue”

– By: Ben Davis in

https://econsultancy.com/blog/64612-the-cost-of-bad-data-stats/

Page 9: BigDansing presentation slides for KAUST

9

What is Data Cleansing?

● To detect and correct corrupt or inaccurate records from a record set, table, or database.

● 25% of world's critical data are dirty:

– Typos, duplicates, outdated data, Missing values

● Dirty data sources:

– Data entry errors

– Data update errors

– Data transmission errors

– Bugs in a data processing tools

Page 10: BigDansing presentation slides for KAUST

10

How to Detect Dirty Data?

● Dirty data is detected by declarative rules:

– A formal way to express dirty data.

Page 11: BigDansing presentation slides for KAUST

11

How to Detect Dirty Data?

● Dirty data is detected by declarative rules:

– A formal way to express dirty data.

– Functional dependencies (FD):● A constraint between two sets of attributes in a relation● Example: Zipcode → City

Page 12: BigDansing presentation slides for KAUST

12

How to Detect Dirty Data?

● Dirty data is detected by declarative rules:

– A formal way to express dirty data.

– Functional dependencies (FD):● A constraint between two sets of attributes in a relation● Example: Zipcode → City

– Conditional functional dependency (CFD),● Country = 'Saudi Arabia', Zipcode → City

Page 13: BigDansing presentation slides for KAUST

13

How to Detect Dirty Data?

● Dirty data is detected by declarative rules:

– A formal way to express dirty data.

– Functional dependencies (FD),

– Conditional functional dependency (CFD),

– Denial constraints (DC):● A set of boolean conditions should not be satisfied ● DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate <

t2.Rate)– There can't exists two tuples in relation D where the

salary of t1 is greater than t2's salary and the Rate of t1 is less than the t2's rate.

Page 14: BigDansing presentation slides for KAUST

14

How to Detect Dirty Data?

● Dirty data is detected by declarative rules:

– A formal way to express dirty data.

– Functional dependencies (FD)

– Conditional functional dependency (CFD),

– Denial constraints (DC).

● or, User defined function (UDF):

– Duplicates, statistical errors.

Page 15: BigDansing presentation slides for KAUST

15

Example of a Dirty Dataset

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

t7 Jon 60601 CH IL 40000 25

● FD: Zipcode → City

● Two tuples sharing the same Zipcode must have the same City name.

Page 16: BigDansing presentation slides for KAUST

16

Example of a Dirty Dataset

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

t7 Jon 60601 CH IL 40000 25

● DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)● There can't exists two tuples in relation D where the salary of t1 is

greater than t2's salary and the Rate of t1 is less than the t2's rate.

Page 17: BigDansing presentation slides for KAUST

17

Example of a Dirty Dataset

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

t7 Jon 60601 CH IL 40000 25

● DC: ∀ t1, t2 ∈ D, ¬(simF(t1.name,t2.name) ˄ t1.city = t2.city)

● There can't exists two tuples in relation D with similar names and live in the same city

Page 18: BigDansing presentation slides for KAUST

18

Data Cleansing Process

Dirtydata

QualityRules

Page 19: BigDansing presentation slides for KAUST

19

Data Cleansing Process

Detect Repair

Update

Dirtydata

QualityRules

Page 20: BigDansing presentation slides for KAUST

20

Data Cleansing Process

Detect Repair

Update

Dirtydata

Cleandata

QualityRules

Page 21: BigDansing presentation slides for KAUST

21

Data Cleansing Process

Detect Repair

Update

Dirtydata

Cleandata

QualityRules

90% runtime 90% research

10% errors max

Page 22: BigDansing presentation slides for KAUST

22

Data Cleansing Process

Update

Dirtydata

Cleandata

QualityRules

90% runtime 90% research

Repair

● Researchers target:

● Better quality

● Less iterations

to reduce runtime

Detect

● Researchers use:

● Naive code

● DBMS

10% errors max

Page 23: BigDansing presentation slides for KAUST

23

A Big Data Challenge

● How to process large data cleansing?● How to support known declarative rules and possible

UDFs?

Page 24: BigDansing presentation slides for KAUST

24

BigDansing architecture

Page 25: BigDansing presentation slides for KAUST

25

BigDansing

✔ Generic abstraction:– Support rule-based detection: FD, CFD, DC

– Support UDF-based detection.

– Easy to use, auto parallelization

– Separate logical from physical operators:● System independent● Provide multiple physical optimizations

Page 26: BigDansing presentation slides for KAUST

26

BigDansing

✔ Generic abstraction:– Support rule-based detection: FD, CFD, DC

– Support UDF-based detection.

– Easy to use, auto parallelization

– Separate logical from physical operators:● System independent● Provide multiple physical optimizations

✔ Fast and scalable detection, repair and updates– 1.9B rows → 13B violations < 3 hours on 16 small machines.

– Related work maximum 1 M rows on a single machine.

Page 27: BigDansing presentation slides for KAUST

27

Abstraction

Page 28: BigDansing presentation slides for KAUST

28

BigDansing Semantics

● Input data set represented by a set of data units

● Each data unit “U”:

– A single row in relational data

– A single triple in RDF data

– A single article in Wikipedia

Page 29: BigDansing presentation slides for KAUST

29

Logical Operators

● Quality rules are represented by 5 functions.

● BigDansing automatically translates declarative rules into logical operators.

● Users are free to implement their logic using Logical operators.

● Fundamental operators: the minimum to represent large set of data quality rules

Fundamental Optional

DetectGenFix

ScopeBlockIterate

Data Cleansing Job

Database

Page 30: BigDansing presentation slides for KAUST

30

BigDansing Semantics - Scope

● Scope:

– Input: data units

– Output: data units

● Example: Zipcode → City

– Input: ● t1 – t7

– Output:● t1 – t7 (Zipcode,City)

Name Zipcode City

t1 Annie 10001 NY

t2 Laure 90210 LA

t3 John 60601 CH

t4 Mark 90210 SF

t5 Robert 60827 CH

t6 Mary 90210 LA

Zipcode City

t1 10001 NY

t2 90210 LA

t3 60601 CH

t4 90210 SF

t5 60827 CH

t6 90210 LA

Page 31: BigDansing presentation slides for KAUST

31

BigDansing Semantics - Block

● Block:

– Input: data unit

– Output: grouping key

● Example: Zipcode → City

– Input:● t1 – t7

– Output: ● <10001, t1>, <90210, (t2,t4,t6)>, <60601,t3)>,

<60827,t5>

Zipcode City

t1 10001 NY

t2 90210 LA

t3 60601 CH

t4 90210 SF

t5 60827 CH

t6 90210 LA

Page 32: BigDansing presentation slides for KAUST

32

BigDansing Semantics - Iterate

● Iterate:

– Input: a group of data units

– Output: single tuple, tuple pair

● Example: Zipcode → City

– Input: ● <10001, t1>, <90210, (t2,t4,t6)>, <60601,t3)>,

<60827,t5>– Output:

● <t2,t4>, <t2,t6>, <t4,t6>

Zipcode City

t1 10001 NY

t2 90210 LA

t3 60601 CH

t4 90210 SF

t5 60827 CH

t6 90210 LA

Page 33: BigDansing presentation slides for KAUST

33

BigDansing Semantics - Detect

● Detect:

– Input: data units

– Output: Violation(s)

● Example: Zipcode → City

– Input:● <t2,t4>, <t2,t6>, <t4,t6>

– Output: ● (t2.City ≠ t4.City), (t4.City ≠ t6.City)

Zipcode City

t1 10001 NY

t2 90210 LA

t3 60601 CH

t4 90210 SF

t5 60827 CH

t6 90210 LA

Page 34: BigDansing presentation slides for KAUST

34

Semantics - GenFix

● GenFix:

– Input: Violation

– Output: possible fix(es)

● Example: Zipcode → City

– Input:● (t2.City ≠ t4.City), (t4.City ≠ t6.City)

– Output:● (t2.City = t4.City), (t4.City = t6.City)

Zipcode City

t1 10001 NY

t2 90210 LA

t3 60601 CH

t4 90210 SF

t5 60827 CH

t6 90210 LA

Page 35: BigDansing presentation slides for KAUST

35

Logical Planning

Page 36: BigDansing presentation slides for KAUST

36

Logical Planning

● Logical plan define the data unit flow

● Validating the plan:

– At least one input dataset

– For UDF: at least one detect

– For Rules: at least one rule● Support simple and bushy plans

Page 37: BigDansing presentation slides for KAUST

37

Logical Planning – FD example

● FD: Zipcode → City

● Operators:

– Scope(Zipcode,City)

– Block(Zipcode)

– Iterate(n2)

– Detect(tx.City ≠ ty.City)

– GenFix(tx.City = ty.City)

Dataset Scope Block Iterate Detect GenFix

Page 38: BigDansing presentation slides for KAUST

38

Logical Planning – DC example

● DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)

● There can't exists two tuples in relation D where the salary of t1 is greater than t2's salary and the Rate of t1 is less than the t2's rate.

● Operators:

– Scope(Salary,Rate)

– Detect(tx.Salary > ty.Salary AND tx.Rate < ty.Rate)

– GenFix(tx.Salary <= ty.Salary OR tx.Rate >= ty.Rate)

Dataset Scope Detect GenFix

Page 39: BigDansing presentation slides for KAUST

39

Logical Planning – UDF only example

● Dataset: Temperature sensors dataset

● Rule: There can't exists a tuple in dataset D where its value is 5º different than the average. Sensor

IDRoom Temp

t1 1 Bedroom 36.6º

t2 2 Roof 40º

t3 3 Bedroom 35.2º

t4 4 Bedroom 43.1º

t5 5 Bedroom 33.5º

Page 40: BigDansing presentation slides for KAUST

40

Logical Planning – UDF only example

● Dataset: Temperature sensors dataset

● Rule: There can't exists a tuple in dataset D where its value is 5º different than the average.

● Operators:

– Scope(Room,Temp)

– Block(Room)

– Iterate(Average the list,tx)

– Detect(tx.temp < avg-c OR tx.temp > avg+c)

– GenFix(tx.temp >= avg-c AND tx.temp <= avg+c)

Sensor ID

Room Temp

t1 1 Bedroom 36.6º

t2 2 Roof 40º

t3 3 Bedroom 35.2º

t4 4 Bedroom 43.1º

t5 5 Bedroom 33.5º

Dataset Scope Block Iterate Detect GenFix

Page 41: BigDansing presentation slides for KAUST

41

Logical Planning – UDF only example

● Dataset: Temperature sensors dataset

● Rule: There can't exists a tuple in dataset D where its value is 5º different than the average.

● Operators:

– Scope(Room,Temp)

– Block(Room)

– Iterate(Average the list,tx)

– Detect(tx.temp < avg-c OR tx.temp > avg+c)

– GenFix(tx.temp >= avg-c AND tx.temp <= avg+c)

Sensor ID

Room Temp

t1 1 Bedroom 36.6º

t2 2 Roof 40º

t3 3 Bedroom 35.2º

t4 4 Bedroom 43.1º

t5 5 Bedroom 33.5º

Dataset Scope Block Iterate Detect GenFix

Page 42: BigDansing presentation slides for KAUST

42

Logical Plans – Bushy plan

● C1, C2 and C3 are denial constraints from ICDE 2013 paper:● Holistic Data Cleaning:

Putting Violations Into Context

Page 43: BigDansing presentation slides for KAUST

43

Physical Plans

Page 44: BigDansing presentation slides for KAUST

44

Physical Plans

● Physical operators are system specific

– MPI, Hadoop, Spark

● Each physical operator is an independent execution unit.

● Each logical operator → one physical operator.

● BigDansing consolidate logical plans to improve I/O.

● More physical operators can be added with different optimizations to improve logical plans.

Page 45: BigDansing presentation slides for KAUST

45

Physical Plans - Plan consolidation

● Plan consolidation is a static logical plan optimizations.

● BigDansing consolidates two similar logical operator if they share same input

Page 46: BigDansing presentation slides for KAUST

46

Physical Plans – Physical translation

● FD: Zipcode → City

Dataset Scope Block Iterate Detect GenFix

Dataset PScope PBlock PIterate PDetect PGenFix

Page 47: BigDansing presentation slides for KAUST

47

Physical Plans - Physical translation

● FD: Zipcode → City

Dataset Scope Block Iterate Detect GenFix

Dataset PScope PBlock PIterate PDetect PGenFix

Dataset PScope PBlock Piterate → Pdetect → PGenFix

Page 48: BigDansing presentation slides for KAUST

48

Physical Plans – Physical translation

● DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)

● There can't exists two tuples where the salary of t1 is greater than t2's salary and the Rate of t1 is less than the t2's rate.

Dataset Scope Detect GenFixDataset Scope Detect GenFix

Dataset Scope CrossProdcutDataset PScope PDetect → PGenFix

Page 49: BigDansing presentation slides for KAUST

49

Physical Plans – Physical translation

● DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)

● There can't exists two tuples where the salary of t1 is greater than t2's salary and the Rate of t1 is less than the t2's rate.

Dataset Scope Detect GenFixDataset Scope Detect GenFix

Dataset Scope CrossProdcutDataset PScope

Dataset Scope UCrossProdcutDataset PScope

PDetect → PGenFix

PDetect → PGenFix

Page 50: BigDansing presentation slides for KAUST

50

Physical Plans – Physical translation

● DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)

● There can't exists two tuples where the salary of t1 is greater than t2's salary and the Rate of t1 is less than the t2's rate.

Dataset Scope Detect GenFixDataset Scope Detect GenFix

Dataset Scope CrossProdcutDataset PScope

Dataset Scope OCJoinDataset PScope

Distributed Sort Merge Join

Dataset Scope UCrossProdcutDataset PScope

PDetect → PGenFix

PDetect → PGenFix

PDetect → PGenFix

Page 51: BigDansing presentation slides for KAUST

51

Experiments – OCJoin vs. Others

● TaxB dataset: DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)● 16 workers

Page 52: BigDansing presentation slides for KAUST

52

OCJoin Physical Operator

● A self join on one or more ordering comparisons:

– (<, >, ≥, ≤)● Reduce the complexity of the cross product by reducing

search space.

● Steps:

– Partitioning into blocks

– Sorting the blocks

– Pruning

– Joining

Page 53: BigDansing presentation slides for KAUST

53

Repair Algorithms

Page 54: BigDansing presentation slides for KAUST

54

Repair Algorithms – Basics

● BigDansing supports most of serial repair algorithms.

● BigDansing utilizes the nature of violations:

– different violations are independent.● The repair is parallelized by running different instances

of the repair algorithm on independent violations.

● We implement two serial repair algorithms to run in distributed mode:

– Equivalence class algorithm

– Hypergraph algorithm

Page 55: BigDansing presentation slides for KAUST

55

Repair Algorithms – Steps

● Connected components → identify independent fixes.

● Each connected component → instance of repair algorithm.

Page 56: BigDansing presentation slides for KAUST

56

Equivalence Class Algorithm

● Fix errors based on (=,≠)

● Based on heuristics:

– Partition the possible fixes into different groups

– Assign the highest frequency value to group● Example:

– Group 1: Zipcode = 60601● Highest frequency = CH

– Group 2: Zipcode = 90210● Highest frequency = LA

Name Zipcode City

t1 Annie 60601 NY

t2 Laure 90210 LA

t3 John 60601 CH

t4 Mark 90210 SF

t5 Robert 60601 CH

t6 Mary 90210 LA

t7 Jon 60601 CH

Page 57: BigDansing presentation slides for KAUST

57

Hyper-Graph algorithm

● Fix errors based on (<,>,≤, and ≥).

● Based on linear optimization and greedy MVC:

– Select hyper-graph node with highest edges

– Change its value depending on edge conditions

t2.Salaryt2.tax

>,<

Name Salary Rate

t1 Annie 24000 15

t2 Laure 25000 10

t3 John 40000 25

t4 Mark 88000 24

t5 Robert 15000 15

t6 Mary 81000 28

t7 Jon 40000 25

t1.Salaryt1.tax

t5.Salaryt5.tax

>,<

t4.Salaryt4.tax

>,<

t3.Salaryt3.tax

t6.Salaryt6.tax

>,<

t7.Salaryt7.tax

>,<

Page 58: BigDansing presentation slides for KAUST

58

Repair algorithms – Possible fixes

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

t7 Jon 60601 CH NY 40000 25

● FD: Zipcode → City:

● t2.City = t4.City

● t4.City = t6.City

t2.City t4.City

t6.City

=

=

Page 59: BigDansing presentation slides for KAUST

59

Repair algorithms – Possible fixes

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

t7 Jon 60601 CH NY 40000 25

● FD: Zipcode → State:

● t3.State = t7.State

t3.State

t7.State

=

Page 60: BigDansing presentation slides for KAUST

60

Repair algorithms – Possible fixes

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

t7 Jon 60601 CH NY 40000 25

● DC: ∀ t1, t2 ∈ D, ¬(t1..Salary > t2.Salary

˄ t1.Rate < t2.Rate):

● t2.Salary > t1.Salary, t2.Tax < t1.Tax

● t2.Salary > t5.Salary, t2.Tax < t5.taxt2.Salary

t2.Tax

t1.Salaryt1.Tax

t5.Salaryt5.Tax

>,<

>,<

Page 61: BigDansing presentation slides for KAUST

61

Repair algorithms – Connected components

t2.City t4.City

t3.City

t3.State

t7.State

t2.Salaryt2.Tax

t1.Salaryt1.Tax

t5.Salaryt5.Tax

t2.City t4.City

t6.City

t3.State

t7.State

t2.Salaryt2.Tax

t1.Salaryt1.Tax

t5.Salaryt5.Tax

>,<

>,<

>,<

>,<

=

=

=

=

= =

Page 62: BigDansing presentation slides for KAUST

62

Repair algorithms – Distributed repair

t2.City t4.City

t6.City

t3.State

t7.State

t2.Salaryt2.Tax

t1.Salaryt1.Tax

t5.Salaryt5.Tax

Equivalence classalgorithm

Equivalence classalgorithm Hyper-graph algorithm

● Different violations require different repair algorithms:

>,<

>,<

=

=

=

Page 63: BigDansing presentation slides for KAUST

63

Use Case: RDF example

Page 64: BigDansing presentation slides for KAUST

64

Use Case: RDF example

● There cannot exist two graduate students in two different universities and have the same professor as advisor

Page 65: BigDansing presentation slides for KAUST

65

Use Case: RDF Example - Input

Page 66: BigDansing presentation slides for KAUST

66

Use Case: RDF Example - Scope

RDF Scope

Page 67: BigDansing presentation slides for KAUST

67

Use Case: RDF Example - Block

RDF Scope Block

Page 68: BigDansing presentation slides for KAUST

68

Use Case: RDF Example - Iterate

RDF Scope Block Iterate

Page 69: BigDansing presentation slides for KAUST

69

Use Case: RDF Example - Block

RDF Scope Block Iterate

Block

Page 70: BigDansing presentation slides for KAUST

70

Use Case: RDF Example - Iterate

RDF Scope Block Iterate

Block Iterate

Page 71: BigDansing presentation slides for KAUST

71

Use Case: RDF Example – Detect, GenFix

RDF Scope Block Iterate

Block Iterate Detect GenFix

Page 72: BigDansing presentation slides for KAUST

72

Use Case: RDF Example – Physical Plan

RDF Scope Block Iterate

Block Iterate Detect GenFix

RDF PScope PBlock PIterate

PBlock Piterate → Pdetect → PGenFix

Page 73: BigDansing presentation slides for KAUST

73

Experiments

Page 74: BigDansing presentation slides for KAUST

74

Datasets

Dataset Type Size Error type

TaxA Synthetic based on real dataset

100K -- 40M Typos

TaxB Synthetic 100K -- 3M Numerical errors

TPCH Synthetic 100K – 1.9 B Typos

Customer1 Real 19M Duplicates

Customer2 Real 32M Duplicates

NCVoters Real 9M Duplicates

HAI Real 166K Typos

Page 75: BigDansing presentation slides for KAUST

75

Systems

● NADEEF: a data cleansing system on sing le mach ine● PostgreSQL: a database management system● Shark: Distributed SQL engine based on Hive and Spark● Spark SQL: Distributed SQL engine based on Spark● BigDansing, BigDansing-Spark● BigDansing-Hadoop

Page 76: BigDansing presentation slides for KAUST

76

Infrastructure and Systems

● Single machine:

– Dell Precision T7500 with two 64-bit quad-core Intel Xeon X5550, and 58GB RAM

● Cluster:

– 17 Shuttle SH55J2 machines (1 master with 16 workers) equipped with Intel i5 processors with 16GB RAM

Page 77: BigDansing presentation slides for KAUST

77

Experiments – Serial FD

● TaxA dataset:

● FD: Zipcode → City

● FD: Zipcode → State

Page 78: BigDansing presentation slides for KAUST

78

Experiments – Serial DC

● TaxB dataset:

– DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)

● OCJoin optimization

Page 79: BigDansing presentation slides for KAUST

79

Experiments – Parallel FD

● TPCH dataset:

● FD: custkey → custAddress

● 16 Workers

Page 80: BigDansing presentation slides for KAUST

80

Experiments – Scalability

● TPCH Dataset:

● FD: custkey → custAddress

● Dataset: 500M rows

Page 81: BigDansing presentation slides for KAUST

81

Points to Remember

● We present BigDansing as a distributed system for data cleansing.

● Easy to use, no need for parallel development experience.

● Faster than all related work.

● Abstraction is independent of distributed system environment.

● Support different physical optimizations for a single logical plan.

● Scales to 1.9B rows, related work only work on 1M rows.

● Natively support repair algorithms without modifications.

Page 82: BigDansing presentation slides for KAUST

82

Questions?

Page 83: BigDansing presentation slides for KAUST

83

Experiments – Parallel DC

● TaxB Dataset

– DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)

● 16 workers

Page 84: BigDansing presentation slides for KAUST

84

Repair Quality