scaling big data cleansing

72
Scaling Big Data Cleansing PHD DEFENSE OF : ZUHAIR KHAYYAT MAY, 2017

Upload: zuhair-khayyat

Post on 21-Jan-2018

52 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Scaling Big Data Cleansing

ScalingBigDataCleansingPHD DEFENSEOF: ZUHAIR KHAYYAT

MAY,2017

Page 2: Scaling Big Data Cleansing

WhatisDataCleansing?q Datacleansingistheprocessof:

A. detectingerrorinrecordsets,tables,ordatabases(violationdetection)

B. andfixingthem(violation repair)

q Exampleerrorsindata:

• Typos • Duplicate • Values inconsistent withbusinessrules

• Outliers • Outdated • Missingvalues

May16,2017 2/73

Page 3: Scaling Big Data Cleansing

WhyDataCleansingisImportant?q 25%ofworld'scriticaldataaredirty

q 60%- 98%ofthedatascientist'stimeislostintheprocessdatacleansing

q “duplicateanddirtydatacoststhehealthcareindustryover$300billion

everyyear”-- JoeFusaro (RingLead)

q “inaccuratedatahasadirectimpact...theaveragecompanylosing12%ofits

revenue”-- BenDavis(Econsultancy)

May16,2017 3/73

Page 4: Scaling Big Data Cleansing

ExampleofaDirtyDatasetACompanyemployeedatabase:

q Rule1:AnytwoemployeesinsameZipcode mustbeinsameCity

q Rule2:Anemployeewhoearnshighersalarymustpaymoretaxescomparedtoothers

Name Zipcode City State Salary Rate

t1

Annie 10001 NY NY 24000 15

t2

Laure 90210 LA CA 25000 10

t3

John 60601 CH IL 40000 25

t4

Mark 90210 SF CA 88000 28

t5

Robert 60827 CH IL 15000 15

t6

Mary 90210 LA CA 81000 28

May16,2017 4/73

Page 5: Scaling Big Data Cleansing

TheProcessofDataCleansing

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Detection

RulesInputData

Dirty

1st:Detect

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Dirty

3rd:Repair

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Dirty

4th:UpdateInputData

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

DirtyDirty

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

CleanData

2st:Analyze

May16,2017 5/73

Page 6: Scaling Big Data Cleansing

WhyDirtyDataisStillaProblem?q Dataisgrowingata40%

compoundannualrate

q Source:Oracle,2012,

https://goo.gl/uHd4uR

≈15Zettabytes

May16,2017 6/73

Page 7: Scaling Big Data Cleansing

4th:UpdateInputData

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

DirtyDirty

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

CleanData

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Dirty

Detection

RulesInputData

ProblemsofBigDataCleansing

1st:Detect

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Dirty

3rd:Repair

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Dirty

2st:Analyze

≈90%Runtime

MostofResearch

0

20

40

60

80

100

1% 5% 10% 50%

Tim

e (S

econ

ds)

Violation percentage

Violation detectionData repair

May16,2017 7/73

Page 8: Scaling Big Data Cleansing

ProblemsofBigDataCleansing

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Dirty

1st:Detect

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Dirty

3rd:Repair

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Dirty

4th:UpdateInputData

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

DirtyDirty

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

CleanData

2st:Analyze

Detection

RulesInputData

1. Violationdetectionbecomestooexpensivewithbigdata:

a. Enumeratingalltuplesisnotpossible

b. Notfeasibletoimplementaparallelversionofeachdetectionrule

c. Serialrepairalgorithmscannothandlebigerrors

May16,2017 8/73

Page 9: Scaling Big Data Cleansing

ProblemsofBigDataCleansing

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Dirty

1st:Detect

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Dirty

3rd:Repair

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Dirty

4th:UpdateInputData

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

DirtyDirty

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

CleanData

2st:Analyze

Detection

RulesInputData

2. Complexerrordiscoveryrulesbasedoninequalityconditionsaretooexpensive:

Rule2:Anemployeewhoearnshighersalarymustpaymoretaxescomparedtoothers

è (ti.salary <t

j.salary)AND(t

i.tax >t

j.tax)

May16,2017 9/73

Page 10: Scaling Big Data Cleansing

ProblemsofBigDataCleansing

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Dirty

1st:Detect

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Dirty

3rd:Repair

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Dirty

4th:UpdateInputData

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

DirtyDirty

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

CleanData

Detection

RulesInputData

3. Errorgraph(violationgraph)israndom,bigandunpredictable:

• Irregularstructures

• Skeweddistributions

• Unpredictableworkloadofalgorithm

2st:Analyze

May16,2017 10/73

Page 11: Scaling Big Data Cleansing

Problems&SolutionsofBigDataCleansingProblems

1. Violationdetectionbecomestoo

expensivewithbigdata

2. Complexerrordiscoveryrulesbasedon

inequalityconditionsaretooexpensive

3. Errorgraph(violationgraph)israndom,

bigandunpredictable

• Developageneralpurpose

scalabledatacleansing

systemBigDansing

• Introducenewjoinalgorithm

toenhanceinequalityjoinsIEJoin

• Buildageneralgraphsystem

thatadaptstovariousgraph

structuresandalgorithmsMizan

Solutions

May16,2017 11/73

Page 12: Scaling Big Data Cleansing

BigDansingASystemforBigDataCleansing

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Dirty

May16,2017 12/73

Page 13: Scaling Big Data Cleansing

Relatedwork?

NADEEF*

* M.Dallachiesa,etal.,“NADEEF:ACommodity

DataCleaningSystem,”inSIGMOD2013

DBMS

UDF

Declarative

Rulesü Easy-to-use

ü Extensible

ü Efficient

☓ Scalable(SingleMachine)

May16,2017 13/73

Page 14: Scaling Big Data Cleansing

WhatdoesBigDataCleansingrequire?1. ScaleDetection

§ Declarativerules

Ø Functionaldependencies(FDs,CFDs)

Ø Denialconstraints(DCs)

§ Userdefinedfunctions

2. ScaleRepairs

§ Handle serialrepairalgorithms

May16,2017 14/73

Page 15: Scaling Big Data Cleansing

BigDansing – ScalingViolationDetectionFunctional

dependenciesDenial

constraintsEntity

resolutionInclusion

dependencies

DomainSpecificLanguage

Scope Block Iterate Detect GenFix

May16,2017 15/73

Page 16: Scaling Big Data Cleansing

BigDansing – Input

UDFScope

Block

Iterate

Detect

GenFixViolationDetectionPlan (LogicalPlan)

Rule

Parser

Declarative

Rules

May16,2017 16/73

Page 17: Scaling Big Data Cleansing

BigDansing – PlanConversionandOptimizationLogicalPlan

PhysicalPlan

ExecutionPlan

May16,2017 17/73

Page 18: Scaling Big Data Cleansing

Rule1– LogicalPlan§ AnytwoemployeesinsameZipcode mustbeinsameCity

§ FD:Zipcodeà City

Scope Block Iterate Detect GenFix

Scope(Zipcode,City) Block(Zipcode) Iterate Detect(Cityi ≠Cityj) GenFix

LogicalOperators

May16,2017 18/73

Page 19: Scaling Big Data Cleansing

Rule1– PhysicalPlan§ AnytwoemployeesinsameZipcode mustbeinsameCity

§ FD:Zipcodeà City

PScope PBlock PIterate PDetect PGenFix CoBlock UCrossProduct

PScope PBlock PIterate PDetect PGenFix

Scope(Zipcode,City) Block(Zipcode) Iterate Detect(Cityi ≠Cityj) GenFix

PhysicalOperators

May16,2017 19/73

Page 20: Scaling Big Data Cleansing

Rule1– ExecutionPlan§ AnytwoemployeesinsameZipcode mustbeinsameCity

§ FD:Zipcodeà City

Spark-

PScope

Spark-

PBlock

Spark-

PIterate

Spark-

PDetect

Spark-

PGenFix

PScope PBlock PIterate PDetect PGenFix

Scope(Zipcode,City) Block(Zipcode) Iterate Detect(Cityi ≠Cityj) GenFix

May16,2017 20/73

Page 21: Scaling Big Data Cleansing

Rule1– ExecutionExampleScope(Zipcode,City) Block(Zipcode) Iterate Detect(Cityi ≠Cityj) GenFix

Name Zipcode City State Salary Rate

t1

Annie 10001 NY NY 24000 15

t2

Laure 90210 LA CA 25000 10

t3

John 60601 CH IL 40000 25

t4

Mark 90210 SF CA 88000 28

t5Robert 60827 CH IL 15000 15

t6

Mary 90210 LA CA 81000 28

Zipcode City

t1

10001 NY

t2

90210 LA

t3

60601 CH

t4

90210 SF

t5

60827 CH

t6

90210 LA

Zipcode City

t1

10001 NY

t2

90210 LA

t4

90210 SF

t6

90210 LA

t3

60601 CH

t5

60827 CH

(t2,t4)

(t2,t6)

(t4,t6)

(t2,t4)

(t4,t6)

t2[City]=t4[City]

t4[City]=t6[City]

1)Scope 3)Iterate2)Block

4)Detect

5)GenFix

May16,2017 21/73

Page 22: Scaling Big Data Cleansing

Rule2– LogicalPlan§Anemployeewhoearnshighersalarymustpaymoretaxescomparedtoothers

§ DC:� ti,t

j� D,¬(t

i.Salary <t

j.Salary ˄t

i.Rate >t

j.Rate)

Name Zipcode City State Salary Rate

t1

Annie 10001 NY NY 24000 15

t2

Laure 90210 LA CA 25000 10

t3

John 60601 CH IL 40000 25

t4

Mark 90210 SF CA 88000 28

t5Robert 60827 CH IL 15000 15

t6

Mary 90210 LA CA 81000 28

• ForAnnie,compareSalarywith:

• Laure

• John

• Mark

• Robert

• Mary

CompareRate

CompareRate

CompareRate

CompareRate

ReportaViolation!

May16,2017 22/73

Page 23: Scaling Big Data Cleansing

Rule2– LogicalPlan§Anemployeewhoearnshighersalarymustpaymoretaxescomparedtoothers

§ DC:� ti,t

j� D,¬(t

i.Salary <t

j.Salary ˄t

i.Rate >t

j.Rate)

Scope(Salary,Rate) IterateDetect(ti.Salary <tj.Salary ˄

ti.Rate >tj.Rate)GenFix

Scope Block Iterate Detect GenFix

LogicalOperators

May16,2017 23/73

Page 24: Scaling Big Data Cleansing

Rule2– PhysicalPlan§Anemployeewhoearnshighersalarymustpaymoretaxescomparedtoothers

§ DC:� ti,t

j� D,¬(t

i.Salary <t

j.Salary ˄t

i.Rate >t

j.Rate)

PScope UCrossProduct PDetect PGenFix

PhysicalOperators

Scope(Salary,Rate) IterateDetect(ti.Salary <tj.Salary ˄

ti.Rate >tj.Rate)GenFix

PScope PBlock PIterate PDetect PGenFix CoBlock UCrossProduct

May16,2017 24/73

Page 25: Scaling Big Data Cleansing

Rule2– ExecutionPlan§Anemployeewhoearnshighersalarymustpaymoretaxescomparedtoothers

§ DC:� ti,t

j� D,¬(t

i.Salary <t

j.Salary ˄t

i.Rate >t

j.Rate)

PScope UCrossProduct PDetect PGenFix

Scope(Salary,Rate) IterateDetect(ti.Salary <tj.Salary ˄

ti.Rate >tj.Rate)GenFix

Spark-

PScope

Spark-

UCrossProduct

Spark-

PDetect

Spark-

PGenFix

May16,2017 25/73

Page 26: Scaling Big Data Cleansing

PlanOptimizations– OCJoin§ DC:� t

i,t

j� D,¬(t

i.Salary <t

j.Salary ˄t

i.Rate >t

j.Rate)

Range

Partitioning

Sorting

Pruning

Joining

Partition1 Partition2 Partition3 Partitionn Basedon

Salary

Basedon

RatePartition1 Partition2 Partition3 Partitionn

Partition1

Partition2

Partition3Partition4

Partition5

Partition6 Partitionn

Partition2 Partition3 Partition5 Partition6⨝ ⨝May16,2017 26/73

Page 27: Scaling Big Data Cleansing

Rule2– ExecutionPlan§ Rule2:Anemployeewhoearnshighersalarymustpaymoretaxescomparedtoothers

§ DC:� ti,t

j� D,¬(t

i.Salary <t

j.Salary ˄t

i.Rate >t

j.Rate)

PScopeOCJoin(ti.Salary <tj.Salary ˄

ti.Rate >tj.Rate)PDetect PGenFIx

Scope(Salary,Rate) IterateDetect(ti.Salary <tj.Salary ˄

ti.Rate >tj.Rate)GenFIx

Spark-

PScopeSpark-OCJoin

Spark-

PDetect

Spark-

PGenFIx

May16,2017 27/73

Page 28: Scaling Big Data Cleansing

WhatdoesBigDataCleansingrequire?1. ScaleDetection

§ Declarativerules

Ø Functionaldependencies(FDs,CFDs)

Ø Denialconstraints(DCs)

§ Userdefinedfunctions

2. ScaleRepairs

§ Handle serialrepairalgorithms

!

May16,2017 28/73

Page 29: Scaling Big Data Cleansing

Rule1:Zipcodeà City

Rule2:� t1,t2� D,¬(t1.Salary<t2.Salary˄t1.Rate>t2.Rate)

BigDansing – StructureoftheViolationGraph

Name Zipcode City State Salary Rate

t1

Annie 10001 NY NY 24000 15

t2

Laure 90210 LA CA 25000 10

t3

John 60601 CH IL 40000 25

t4

Mark 90210 SF CA 88000 28

t5

Robert 60827 CH IL 15000 15

t6

Mary 90210 LA CA 81000 28

• Rule1:t2[City]=t

4[City]

• Rule2:t1[Salary]> t

2[Salary]

ORt1[Tax]< t

2[Tax]

May16,2017 29/73

Page 30: Scaling Big Data Cleansing

Rule1:Zipcodeà City

Rule2:� t1,t2� D,¬(t1.Salary<t2.Salary˄t1.Rate>t2.Rate)

BigDansing – StructureoftheViolationGraph

t1

t5

t2 t4

t6

R1:City

R1:City

R2:Salary,Tax• Rule1:t2[City]=t

4[City]

• Rule1:t4[City]=t

6[City]

• Rule2:t1[Salary]> t

2[Salary]ORt

1[Tax]< t

2[Tax]

• Rule2:t5[Salary]> t

2[Salary]ORt

5[Tax]< t

2[Tax]

May16,2017 30/73

Page 31: Scaling Big Data Cleansing

BigDansing – DataRepairasaBlackbox

t1

t5

t2 t4

t6

R1:City

R1:City

R2:Salary,Tax t1

t5

t2

R2:Salary,Tax

t2 t4

t6

R1:City

GraphAnalysis

SerialRepair

Algorithm

SerialRepair

Algorithm

SerialRepair

Algorithm

tytxR1:City

tytxR1:City

May16,2017 31/73

Page 32: Scaling Big Data Cleansing

BigDansing – ApacheSparkStack

May16,2017 32/73

Page 33: Scaling Big Data Cleansing

PerformanceofaSingleMachine

0

1000

2000

3000

4000

5000

6000

100,000 1,000,000 10,000,000

Runt

ime

(Sec

onds

)

Dataset size (rows)

BigDansingNADEEFPostgreSQL

Spark SQLShark

5 18 8655

368

0.26

4

37

3183

4 8 802 47

4153

0

2000

4000

6000

8000

10000

12000

14000

16000

100,000 200,000 300,000Ru

ntim

e (S

econ

ds)

Dataset size (rows)

BigDansingNADEEFPostgreSQL

Spark SQLShark

10 30 62833

4529

9336

2133

8780

3731

7982

Rule1 Rule2

May16,2017 33/73

Page 34: Scaling Big Data Cleansing

0

20000

40000

60000

80000

100000

120000

1M 2M 3MTi

me

(Sec

onds

)Dataset size (rows)

BigDansing-SparkSpark SQLShark

1240

5319

7730

0

5000

10000

15000

20000

10M 20M 40M

Tim

e (S

econ

ds)

Dataset size (rows)

BigDansing-SparkBigDansing-HadoopSpark SQLShark

121

150

337

503

865 23

02

159

313

66237

39

1411

3

1268

22

Performanceona16-machineclusterRule1 Rule2

May16,2017 34/73

Page 35: Scaling Big Data Cleansing

0

25000

50000

75000

100000

125000

1 2 4 8 16

Runt

ime

(Sec

onds

)#-workers

BigDansing

Spark SQL

0

40000

80000

120000

160000

200000

647M 959M 1271M1583M1907M

Tim

e (S

econ

ds)

Dataset size (rows)

BigDansing-SparkBigDansing-HadoopSpark SQL

712

2307

5113

8670

1188

0

2480

3

5288

6 9223

6 1389

32

1961

33

9263 1787

2

3019

5

4690

7

6511

5

Performanceona16-machinecluster

May16,2017 35/73

Page 36: Scaling Big Data Cleansing

DetectingViolationsonRDF

Scope Block1 Iterate1

Block2 Iterate2

Block3 Iterate3

Detect GenFix

May16,2017 36/73

Page 37: Scaling Big Data Cleansing

DetectingViolationsonRDF

0

1000

2000

3000

4000

5000

BigDansing

S2RDFBigDansing

S2RDFBigDansing

S2RDFBigDansing

S2RDF

Runt

ime

(Sec

onds

)

Number of RDF triples

Pre-processingViolation Detection

170M85M42M21M*AlexanderSchätzle, etal., “S2RDF:

RDFQueryingwithSPARQLonSpark”,

inPVLDB2016

* * * *

May16,2017 37/73

Page 38: Scaling Big Data Cleansing

BigDansing:ASystemforBigDataCleansing

ü Easy-to-use

ü Efficient

ü Extensible

ü Scalable

*ZuhairKhayyat,etal.,“BigDansing:ASystemforBigDataCleansing”,

inSIGMOD2015.

May16,2017 38/73

Page 39: Scaling Big Data Cleansing

IEJoinFastandScalableInequalityJoins

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

May16,2017 39/73

Page 40: Scaling Big Data Cleansing

OCJoin inBigDansing

0

20000

40000

60000

80000

100000

100,000 200,000 300,000

Runt

ime

(Sec

onds

)

Data size (rows)

OCJoinUCrossProductCross product

97 103

1264279

2291

2

6177

2

4953

2707

8

8252

40

20000

40000

60000

80000

100000

120000

1M 2M 3M

Tim

e (S

econ

ds)

Dataset size (rows)

BigDansing-SparkSpark SQLShark

1240

5319

7730

May16,2017 40/73

Page 41: Scaling Big Data Cleansing

WhatistheProblem?q Rule2:� t1,t2� D,¬(t1.Salary<t2.Salary˄t1.Rate>t2.Rate)

§ Select*fromDt1JOINDt2ont1.Salary<t2.SalaryAND t1.Tax>t2.Tax

q ProcessedasaCartesianproduct:O(n2)

May16,2017 41/73

Page 42: Scaling Big Data Cleansing

RelatedWorkq BandJoin:

§Basedonapointwithinarange:R.A−c1≤S.B&S.B≤R.A+c2

q Intervaljoinintemporalandspatialdata:notgeneral

q Spatialindexing:

§ Largememoryfootprint

§Expensivepreprocessing

May16,2017 42/73

Page 43: Scaling Big Data Cleansing

IEJoin – aNewJoinAlgorithmq Indatacleansing:

§ Q1:Select*fromDt1JOINDt2ont1.Salary>t2.SalaryAND t1.Tax<t2.Tax

q Intervalintersection:

§Q2:SELECT*FROMEventsr,EventssWHEREr.start ≤s.end ANDr.end ≥s.start

q Joiningtableswith(≠):

§Qk:SELECT*FROMEventsr,EventssWHEREr.start ≤s.end ANDr.end ≠s.start

May16,2017 43/73

Page 44: Scaling Big Data Cleansing

AlgorithmDiscovery

t3(150) t4(120) t1(100) t2(90)

Q1:Select*fromDt1JOINDt2ont1.Salary<t2.SalaryAND t1.Rate>t2.Rate

SortdescendingonSalary:

Salarypartialanswer:(t2,t

1),(t

2,t

4),(t

2,t

3)….(t

4,t

3)

t3(15) t4(10) t2(9) t1(5)

SortdescendingonRate:

Ratepartialanswer:(t1,t

2),(t

1,t

4),(t

1,t

3)….(t

4,t

3)

Salary Rate

t1 100 5

t2 90 9

t3 150 15

t4 120 10

May16,2017 44/73

Page 45: Scaling Big Data Cleansing

AlgorithmDiscovery

Salary Rate

t1 100 5

t2 90 9

t3 150 15

t4 120 10

Q1:Select*fromDt1JOINDt2ont1.Salary<t2.SalaryAND t1.Rate>t2.Rate

Ratepartialanswer:

(t1,t2),(t1,t4),(t1,t3),

(t2,t4),(t2,t3),

(t4,t3)}

Salarypartialanswer:

(t2,t1),(t2,t4),(t2,t3),

(t1,t4),(t1,t3),

(t4,t3)}

Theexpectedresultis:(t2,t1)

May16,2017 45/73

Page 46: Scaling Big Data Cleansing

IEJoin – theAlgorithmq SortDescendingonSalary:

q SortDescendingonRate:

Salary Rate

t1 100 5

t2 90 9

t3 150 15

t4 120 10

t3(150) t4(120) t1(100) t2(90) 0 1 2 3

PermutationArray

t3(15) t4(10) t2(9) t1(5) 0 1 3 2

0 0 0 0

t3 t4 t2 t1

1 1 11

Sequentialscan

Randomaccess

Result=(t2,t1)

Bit-Array

May16,2017 46/73

Page 47: Scaling Big Data Cleansing

SortingOrdersQ1:Select*fromDt1JOINDt2ont1.Salary<t2.SalaryAND t1.Rate>t2.Rate

q Forselfjoins:

§ Salary:ascending orderifOP1iseither>or≥,otherwisedescending order

§ Rate:descending orderifOP1iseither>or≥,otherwiseascending order

§ Non-selfjoins:

§ Salary:descending orderifOP1iseither>or≥,otherwisedescending order

§ Rate:ascending orderifOP1iseither>or≥,otherwisedescending order

OP1 OP2

May16,2017 47/73

Page 48: Scaling Big Data Cleansing

Optimizations– BitmapIndex

0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0

0 1 0 0

C1 C2 C3 C4

(i) pos 6 (ii) pos 9

B

max

May16,2017 48/73

Page 49: Scaling Big Data Cleansing

Optimizations– NotEqualOperatorq Converteach(≠)intoone(>)andone(<)joinedwithUNIONALLoperator

Qk:SELECT*FROMEventsr,EventssWHEREr.start ≤s.end ANDr.end ≠s.start

Q’k:SELECT*FROMEventsr,EventssWHEREr.start ≤s.end ANDr.end < s.start

UNIONALL

SELECT*FROMEventsr,EventssWHEREr.start ≤s.end ANDr.end >s.start

May16,2017 49/73

Page 50: Scaling Big Data Cleansing

Optimizations– SelectivityEstimationq Aquerywiththreeattributes: r.Salary <s.Salary ANDr.Rate >s.Rate ANDr.Age >s.Ageq Usesamplingtoestimatethemaximumoutputsize– Est(Salary,Rate),Est(Salary,Tax),Est(Tax,Age)

Range

Partitioning

Sorting

Pruning

Calculate

MaxOutput

Partition1 Partition2 Partition3 PartitionnBasedon

OP1

Basedon

OP2Partition1 Partition2 Partition3 Partitionn

Partition1

Partition2

Partition3Partition4

Partition5

Partition6 Partitionn

EstimatedOutput=numberofoverlappingpartitions=2

May16,2017 50/73

Page 51: Scaling Big Data Cleansing

IEJoin andBigDansing

May16,2017 51/73

Page 52: Scaling Big Data Cleansing

SerialIEJoin vs.NaïveBaseline

0.010.1

110

1001000

10000

10K 50K 100K

Runt

ime

(Sec

onds

)

Input size

PG-IEJoinPG-Original

MonetDBDBMS-X

0.010.1

110

1001000

10000

10K 50K 100K

Runt

ime

(Sec

onds

)Input size

PG-IEJoinPG-Original

MonetDBDBMS-X

Salary-Rate IntervalIntersection

May16,2017 52/73

Page 53: Scaling Big Data Cleansing

0

2000

4000

6000

8000

10000

PG-IEJoinPG-GiST

PG-BTreePG-IEJoin

PG-GiSTPG-BTree

Runt

ime

(Sec

onds

)

Indexing QueryingX

146

3928

X

310

6287

Q2Q1

SerialIEJoin vs.PostgreswithIndex– 50MRows

16workers1workers

GiST: Generalized Search Tree

May16,2017 53/73

Page 54: Scaling Big Data Cleansing

ParallelandDistributedIEJoin – 100MRows

040008000

120001600020000

Parallel-IEJoin

Distributed-IEJoin

DPG-GiST

DPG-BTree

SparkSQL-SM

SparkSQL

Runt

ime

(Sec

onds

) Indexing QueryingX X X X

4302

1313

040008000

120001600020000

Parallel-IEJoin

Distributed-IEJoin

DPG-GiST

DPG-BTree

SparkSQL-SM

SparkSQL

Runt

ime

(Sec

onds

) Indexing QueryingX X X

4965

1376

Salary-Rate IntervalIntersection

May16,2017 54/73

Page 55: Scaling Big Data Cleansing

IEJoinq Anewjoinalgorithm

q Basedonconditions:(<,≤,>,≥,≠)

q Extremelyfastandhighlyscalable

q Utilizessortingandefficientdatastructures

q EasytoimplementintraditionalDBMSanddistributedsystems

*ZuhairKhayyat,etal., “FastandScalableInequalityJoins”,TheVLDBJournal2017,SpecialIssue:BestPapersofVLDB2015

*ZuhairKhayyat ,etal.,“LightningFastandSpaceEfficientInequalityJoins”,inPVLDB2015

May16,2017 55/73

Page 56: Scaling Big Data Cleansing

MizanASystemforDynamicLoad

BalancinginLarge-scaleGraph

Processing

May16,2017 56/73

Page 57: Scaling Big Data Cleansing

BigDansing’s implementations

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Detection

RulesInputData

Dirty

1st:Detect

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Dirty

3rd:Repair

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Dirty

4th:UpdateInputData

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

DirtyDirty

Name Zipcode City State Salary Rate

t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

CleanData

BigDansing

Apache

Hadoop

Giraph

Apache

Spark

GraphX

HDFS

2st:Analyze

May16,2017 57/73

Page 58: Scaling Big Data Cleansing

Pregel*/Giraph Abstractionq Basedonvertex-centriccomputation

q Abstraction:

§ compute(),combine()&aggregate()

q Synchronousin-memorybulk

synchronousparallel(BSP)

* G. Malewicz, et al., “Pregel: A System for Large-Scale Graph Processing,” inSIGMOD2010

Superstep 1 Superstep 2 Superstep 3

Worker 3

Worker 2

Worker 1

Worker 3

Worker 2

Worker 1

Worker 3

Worker 2

Worker 1

BSP Barrier

May16,2017 58/73

Page 59: Scaling Big Data Cleansing

ProblemsofGiraphTheLightSide TheDarkSide

§ Algorithm:

§ Unforeseen

§ Structure:

§ Variable

§ Algorithm:

§ Predictable

§ Structure:

§ Fixed

Errorgraph(violationgraph)israndom,bigandunpredictable

May16,2017 59/73

Page 60: Scaling Big Data Cleansing

HowGiraph Optimize Computations1. FasterGraphLoading

§ Simplegraphpartitioning

§ Hash,Range

2. Optimizedforgraphstructure

§ Sophisticatedandexpensive

partitioningtechniques

§ Min-cuts

0

50

100

150

200

250

300

350

LiveJournalkgraph4m68m

arabic-2005Ru

n Ti

me

(Min

)

HashRange

Min-cuts

Theruntimeofasingleiterationis

asfastastheslowestworker

May16,2017 60/73

Page 61: Scaling Big Data Cleansing

BehaviorsofDifferentGraphAlgorithms

0.0010.01

0.11

10100

1000

0 10 20 30 40 50 60

In M

essa

ges

(Mill

ions

)

SuperSteps

PageRank - TotalPageRank - Max/W

DMST - TotalDMST - Max/W

PageRankvs.DistributedMinimalSpanningTree

May16,2017 61/73

Page 62: Scaling Big Data Cleansing

SourceofImbalanceinGiraph1. Highvertexresponsetime

2. Longtimetoreceiveincomingmessages

3. Longtimetosendoutgoingmessages

Superstep 1

-High vertex response time

Worker 3

Worker 2

Worker 1

Worker 3

Worker 2

Worker 1

BSP Barrier

Superstep 1

-Long time to receive in messages

Worker 3

Worker 2

Worker 1

Worker 3

Worker 2

Worker 1

BSP Barrier

Superstep 1

-Long time to send out messages

Worker 3

Worker 2

Worker 1

Worker 3

Worker 2

Worker 1

BSP Barrier

May16,2017 62/73

Page 63: Scaling Big Data Cleansing

Mizan – SolvingtheWorkloadImbalanceqMoveverticesbetweenworkersduringruntime

q PlanningandvertexmigrationswithintheBSPbarrierto

maintaincomputationconsistency

Superstep 1 Superstep 2 Superstep 3

Worker 3

Worker 2

Worker 1

Worker 3

Worker 2

Worker 1

Worker 3

Worker 2

Worker 1

BSP BarrierMigration Barrier Migration Planner

Communicator - DHT

Vertex Compute()

BSP Graph Processor

Storage Manager

HDFS/Local Disks

IO

Mizan Worker

Load Balancer: Migration Planner

May16,2017 63/73

Page 64: Scaling Big Data Cleansing

Mizan’s MigrationPlanningSteps1. Identifythesourceofworkloadimbalanceacrossworkers

§ Remoteoutgoingmessages

§ Allincomingmessages

§ Responsetime V1

Worker 2Worker 1Remote Incoming Messages

Remote Outgoing MessagesVertex

Response Time

V3V2V4

Mizan

V5V6

Mizan

Local Incoming Messages

May16,2017 64/73

Page 65: Scaling Big Data Cleansing

Mizan’s MigrationPlanningSteps1. Identifythesourceofworkloadimbalanceacrossworkers

2. Selectthemigrationobjectivethroughastatisticalanalysis

§ Optimizeforoutgoingmessages, or

§ Optimizeforincomingmessages,or

§ Optimizeforresponsetime

May16,2017 65/73

Page 66: Scaling Big Data Cleansing

Mizan’s MigrationPlanningSteps1. Identifythesourceofworkloadimbalanceacrossworkers

2. Selectthemigrationobjectivethroughastatisticalanalysis

3. Pairover-utilizedworkerswithunder-utilizedones

W7 W2 W1 W5 W8 W4 W0 W6 W3

0 1 2 3 4 5 6 7 8

W9

May16,2017 66/73

Page 67: Scaling Big Data Cleansing

Mizan’s MigrationPlanningSteps1. Identifythesourceofworkloadimbalanceacrossworkers

2. Selectthemigrationobjectivethroughastatisticalanalysis

3. Pairover-utilizedworkerswithunder-utilizedones

4. Selectverticestomigrate

§ Selecttheleastnumberofverticesthathasthehighestimpact

§ Vertexownership:distributedhashtable(DHT)

§ Delayedmigration:reducemigrationcost

May16,2017 67/73

Page 68: Scaling Big Data Cleansing

05

10152025303540

Stat

icW

SM

izan

Stat

icW

SM

izan

Stat

icW

SM

izan

Runt

ime

(Min

)

MetisRangeHash

PerformanceofMizan onPageRank

May16,2017 68/73

Page 69: Scaling Big Data Cleansing

0

50

100

150

200

250

300

AdvertismentDMST

Runt

ime

(Min

)

StaticWork Stealing

Mizan

PerformanceofMizan withMetis

May16,2017 69/73

Page 70: Scaling Big Data Cleansing

Mizan – aGeneralGraphProcessingSystemq APregel-clone

§ Supportsverylargegraphs

§ Runsonverylargeclusters

q Dynamicfine-grainedvertexmigrationsto

balancecomputationandcommunication

q Optimizedforpredictableandnon-

predictablegraphalgorithmsandstructures

BigDansing

Apache

Spark

Mizan

GraphX

*ZuhairKhayyat,etal.,“Mizan:ASystemforDynamicLoad

BalancinginLarge-scaleGraphProcessing”,inEuroSys 2013

GiraphHDFS

May16,2017 70/73

Page 71: Scaling Big Data Cleansing

Summary• Ageneralsystemforbigdatacleansing

• Performanceupto2ordersofmagnitudefaster

• SIGMOD2015

§ Anovelalgorithmforfastinequalityjoins

§ Performanceleast2ordersofmagnitude

faster

§ PVLDB2015&VLDBJ2017

§ Ageneralsystemfordistributedgraph

processing

§ Performanceimprovementsupto84%

§ EuroSys 2013

May16,2017 71/73

Page 72: Scaling Big Data Cleansing

Publications" ZuhairKhayyat,WilliamLucia,Meghna Singh,Mourad Ouzzani,PaoloPapotti,Jorge-ArnulfoQuiané-Ruiz,NanTang,Panos

Kalnis,“FastandScalableInequalityJoins”,TheVLDBJournal2017 specialissue:BestPapersofVLDB2015.

" Divy Agrawal,Lamine Ba,LaureBerti-Equille,SanjayChawla,AhmedElmagarmid,Hossam Hammady,YasserIdris,Zoi

Kaoudi,ZuhairKhayyat, SebastianKruse,Mourad Ouzzani,PaoloPapotti,Jorge-ArnulfoQuiané-Ruiz,NanTang,MohammedJ.

Zaki,“Rheem:EnablingMulti-PlatformTaskExecution”,inSIGMOD2016.

" ZuhairKhayyat,WilliamLucia,Meghna Singh,Mourad Ouzzani,PaoloPapotti,Jorge-ArnulfoQuiané-Ruiz,NanTang,Panos

Kalnis,“LightningFastandSpaceEfficientInequalityJoins”,inPVLDB2015.

" ZuhairKhayyat,Ihab F.Ilyas,Alekh Jindal,SamuelMadden,Mourad Ouzzani,PaoloPapotti,Jorge-ArnulfoQuiané-Ruiz,Nan

Tang,SiYin,“BigDansing:ASystemforBigDataCleansing”,inSIGMOD2015.

" ZuhairKhayyat,KarimAwara,AmaniAlonazi,HaniJamjoom,DanWilliams,Panos Kalnis,“Mizan:ASystemforDynamicLoadBalancinginLarge-scaleGraphProcessing”,inEuroSys 2013.

May16,2017 73/73