scaling big data cleansing

ScalingBigDataCleansingPHD DEFENSEOF: ZUHAIR KHAYYAT

MAY,2017

WhatisDataCleansing?q Datacleansingistheprocessof:

A. detectingerrorinrecordsets,tables,ordatabases(violationdetection)

B. andfixingthem(violation repair)

q Exampleerrorsindata:

• Typos • Duplicate • Values inconsistent withbusinessrules

• Outliers • Outdated • Missingvalues

May16,2017 2/73

WhyDataCleansingisImportant?q 25%ofworld'scriticaldataaredirty

q 60%- 98%ofthedatascientist'stimeislostintheprocessdatacleansing

q “duplicateanddirtydatacoststhehealthcareindustryover$300billion

everyyear”-- JoeFusaro (RingLead)

q “inaccuratedatahasadirectimpact...theaveragecompanylosing12%ofits

revenue”-- BenDavis(Econsultancy)

May16,2017 3/73

ExampleofaDirtyDatasetACompanyemployeedatabase:

q Rule1:AnytwoemployeesinsameZipcode mustbeinsameCity

q Rule2:Anemployeewhoearnshighersalarymustpaymoretaxescomparedtoothers

Name Zipcode City State Salary Rate

t1

Annie 10001 NY NY 24000 15

t2

Laure 90210 LA CA 25000 10

t3

John 60601 CH IL 40000 25

t4

Mark 90210 SF CA 88000 28

t5

Robert 60827 CH IL 15000 15

t6

Mary 90210 LA CA 81000 28

May16,2017 4/73

TheProcessofDataCleansing


t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Detection

RulesInputData

Dirty

1st:Detect


t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Dirty

3rd:Repair


t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Dirty

4th:UpdateInputData


t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

DirtyDirty


t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

CleanData

2st:Analyze

May16,2017 5/73

WhyDirtyDataisStillaProblem?q Dataisgrowingata40%

compoundannualrate

q Source:Oracle,2012,

https://goo.gl/uHd4uR

≈15Zettabytes

May16,2017 6/73

4th:UpdateInputData


t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

DirtyDirty


t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

CleanData


t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Dirty

Detection

RulesInputData

ProblemsofBigDataCleansing

1st:Detect


t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Dirty

3rd:Repair


t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Dirty

2st:Analyze

≈90%Runtime

MostofResearch

0

20

40

60

80

100

1% 5% 10% 50%

Tim

e (S

econ

ds)

Violation percentage

Violation detectionData repair

May16,2017 7/73



t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Dirty

1st:Detect


t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Dirty

3rd:Repair


t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Dirty

4th:UpdateInputData


t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

DirtyDirty


t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

CleanData

2st:Analyze

Detection

RulesInputData

1. Violationdetectionbecomestooexpensivewithbigdata:

a. Enumeratingalltuplesisnotpossible

b. Notfeasibletoimplementaparallelversionofeachdetectionrule

c. Serialrepairalgorithmscannothandlebigerrors

May16,2017 8/73



t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Dirty

1st:Detect


t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Dirty

3rd:Repair


t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Dirty

4th:UpdateInputData


t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

DirtyDirty


t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

CleanData

2st:Analyze

Detection

RulesInputData

2. Complexerrordiscoveryrulesbasedoninequalityconditionsaretooexpensive:

Rule2:Anemployeewhoearnshighersalarymustpaymoretaxescomparedtoothers

è (ti.salary <t

j.salary)AND(t

i.tax >t

j.tax)

May16,2017 9/73



t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Dirty

1st:Detect


t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Dirty

3rd:Repair


t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Dirty

4th:UpdateInputData


t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

DirtyDirty


t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

CleanData

Detection

RulesInputData

3. Errorgraph(violationgraph)israndom,bigandunpredictable:

• Irregularstructures

• Skeweddistributions

• Unpredictableworkloadofalgorithm

2st:Analyze

May16,2017 10/73

Problems&SolutionsofBigDataCleansingProblems

1. Violationdetectionbecomestoo

expensivewithbigdata

2. Complexerrordiscoveryrulesbasedon

inequalityconditionsaretooexpensive

3. Errorgraph(violationgraph)israndom,

bigandunpredictable

• Developageneralpurpose

scalabledatacleansing

systemBigDansing

• Introducenewjoinalgorithm

toenhanceinequalityjoinsIEJoin

• Buildageneralgraphsystem

thatadaptstovariousgraph

structuresandalgorithmsMizan

Solutions

May16,2017 11/73

BigDansingASystemforBigDataCleansing


t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Dirty

May16,2017 12/73

Relatedwork?

NADEEF*

* M.Dallachiesa,etal.,“NADEEF:ACommodity

DataCleaningSystem,”inSIGMOD2013

DBMS

UDF

Declarative

Rulesü Easy-to-use

ü Extensible

ü Efficient

☓ Scalable(SingleMachine)

May16,2017 13/73

WhatdoesBigDataCleansingrequire?1. ScaleDetection

§ Declarativerules

Ø Functionaldependencies(FDs,CFDs)

Ø Denialconstraints(DCs)

§ Userdefinedfunctions

2. ScaleRepairs

§ Handle serialrepairalgorithms

May16,2017 14/73

BigDansing – ScalingViolationDetectionFunctional

dependenciesDenial

constraintsEntity

resolutionInclusion

dependencies

DomainSpecificLanguage

Scope Block Iterate Detect GenFix

May16,2017 15/73

BigDansing – Input

UDFScope

Block

Iterate

Detect

GenFixViolationDetectionPlan (LogicalPlan)

Rule

Parser

Declarative

Rules

May16,2017 16/73

BigDansing – PlanConversionandOptimizationLogicalPlan

PhysicalPlan

ExecutionPlan

May16,2017 17/73

Rule1– LogicalPlan§ AnytwoemployeesinsameZipcode mustbeinsameCity

§ FD:Zipcodeà City


Scope(Zipcode,City) Block(Zipcode) Iterate Detect(Cityi ≠Cityj) GenFix

LogicalOperators

May16,2017 18/73

Rule1– PhysicalPlan§ AnytwoemployeesinsameZipcode mustbeinsameCity


PScope PBlock PIterate PDetect PGenFix CoBlock UCrossProduct

PScope PBlock PIterate PDetect PGenFix


PhysicalOperators

May16,2017 19/73

Rule1– ExecutionPlan§ AnytwoemployeesinsameZipcode mustbeinsameCity


Spark-

PScope

Spark-

PBlock

Spark-

PIterate

Spark-

PDetect

Spark-

PGenFix

PScope PBlock PIterate PDetect PGenFix


May16,2017 20/73

Rule1– ExecutionExampleScope(Zipcode,City) Block(Zipcode) Iterate Detect(Cityi ≠Cityj) GenFix


t1

Annie 10001 NY NY 24000 15

t2

Laure 90210 LA CA 25000 10

t3

John 60601 CH IL 40000 25

t4

Mark 90210 SF CA 88000 28

t5Robert 60827 CH IL 15000 15

t6

Mary 90210 LA CA 81000 28

Zipcode City

t1

10001 NY

t2

90210 LA

t3

60601 CH

t4

90210 SF

t5

60827 CH

t6

90210 LA

Zipcode City

t1

10001 NY

t2

90210 LA

t4

90210 SF

t6

90210 LA

t3

60601 CH

t5

60827 CH

(t2,t4)

(t2,t6)

(t4,t6)

(t2,t4)

(t4,t6)

t2[City]=t4[City]

t4[City]=t6[City]

1)Scope 3)Iterate2)Block

4)Detect

5)GenFix

May16,2017 21/73

Rule2– LogicalPlan§Anemployeewhoearnshighersalarymustpaymoretaxescomparedtoothers

§ DC:� ti,t

j� D,¬(t

i.Salary <t

j.Salary ˄t

i.Rate >t

j.Rate)


t1

Annie 10001 NY NY 24000 15

t2

Laure 90210 LA CA 25000 10

t3

John 60601 CH IL 40000 25

t4

Mark 90210 SF CA 88000 28

t5Robert 60827 CH IL 15000 15

t6

Mary 90210 LA CA 81000 28

• ForAnnie,compareSalarywith:

• Laure

• John

• Mark

• Robert

• Mary

CompareRate

CompareRate

CompareRate

CompareRate

ReportaViolation!

May16,2017 22/73

Rule2– LogicalPlan§Anemployeewhoearnshighersalarymustpaymoretaxescomparedtoothers

§ DC:� ti,t

j� D,¬(t

i.Salary <t

j.Salary ˄t

i.Rate >t

j.Rate)

Scope(Salary,Rate) IterateDetect(ti.Salary <tj.Salary ˄

ti.Rate >tj.Rate)GenFix


LogicalOperators

May16,2017 23/73

Rule2– PhysicalPlan§Anemployeewhoearnshighersalarymustpaymoretaxescomparedtoothers

§ DC:� ti,t

j� D,¬(t

i.Salary <t

j.Salary ˄t

i.Rate >t

j.Rate)

PScope UCrossProduct PDetect PGenFix

PhysicalOperators



PScope PBlock PIterate PDetect PGenFix CoBlock UCrossProduct

May16,2017 24/73

Rule2– ExecutionPlan§Anemployeewhoearnshighersalarymustpaymoretaxescomparedtoothers

§ DC:� ti,t

j� D,¬(t

i.Salary <t

j.Salary ˄t

i.Rate >t

j.Rate)

PScope UCrossProduct PDetect PGenFix



Spark-

PScope

Spark-

UCrossProduct

Spark-

PDetect

Spark-

PGenFix

May16,2017 25/73

PlanOptimizations– OCJoin§ DC:� t

i,t

j� D,¬(t

i.Salary <t

j.Salary ˄t

i.Rate >t

j.Rate)

Range

Partitioning

Sorting

Pruning

Joining

Partition1 Partition2 Partition3 Partitionn Basedon

Salary

Basedon

RatePartition1 Partition2 Partition3 Partitionn

Partition1

Partition2

Partition3Partition4

Partition5

Partition6 Partitionn

Partition2 Partition3 Partition5 Partition6⨝ ⨝May16,2017 26/73

Rule2– ExecutionPlan§ Rule2:Anemployeewhoearnshighersalarymustpaymoretaxescomparedtoothers

§ DC:� ti,t

j� D,¬(t

i.Salary <t

j.Salary ˄t

i.Rate >t

j.Rate)

PScopeOCJoin(ti.Salary <tj.Salary ˄

ti.Rate >tj.Rate)PDetect PGenFIx


ti.Rate >tj.Rate)GenFIx

Spark-

PScopeSpark-OCJoin

Spark-

PDetect

Spark-

PGenFIx

May16,2017 27/73

WhatdoesBigDataCleansingrequire?1. ScaleDetection

§ Declarativerules

Ø Functionaldependencies(FDs,CFDs)

Ø Denialconstraints(DCs)

§ Userdefinedfunctions

2. ScaleRepairs

§ Handle serialrepairalgorithms

!

May16,2017 28/73

Rule1:Zipcodeà City

Rule2:� t1,t2� D,¬(t1.Salary<t2.Salary˄t1.Rate>t2.Rate)

BigDansing – StructureoftheViolationGraph


t1

Annie 10001 NY NY 24000 15

t2

Laure 90210 LA CA 25000 10

t3

John 60601 CH IL 40000 25

t4

Mark 90210 SF CA 88000 28

t5

Robert 60827 CH IL 15000 15

t6

Mary 90210 LA CA 81000 28

• Rule1:t2[City]=t

4[City]

• Rule2:t1[Salary]> t

2[Salary]

ORt1[Tax]< t

2[Tax]

May16,2017 29/73

Rule1:Zipcodeà City

Rule2:� t1,t2� D,¬(t1.Salary<t2.Salary˄t1.Rate>t2.Rate)

BigDansing – StructureoftheViolationGraph

t1

t5

t2 t4

t6

R1:City

R1:City

R2:Salary,Tax• Rule1:t2[City]=t

4[City]

• Rule1:t4[City]=t

6[City]


2[Salary]ORt

1[Tax]< t

2[Tax]


2[Salary]ORt

5[Tax]< t

2[Tax]

May16,2017 30/73

BigDansing – DataRepairasaBlackbox

t1

t5

t2 t4

t6

R1:City

R1:City

R2:Salary,Tax t1

t5

t2

R2:Salary,Tax

t2 t4

t6

R1:City

GraphAnalysis

SerialRepair

Algorithm

SerialRepair

Algorithm

SerialRepair

Algorithm

tytxR1:City

tytxR1:City

May16,2017 31/73

BigDansing – ApacheSparkStack

May16,2017 32/73

PerformanceofaSingleMachine

0

1000

2000

3000

4000

5000

6000

100,000 1,000,000 10,000,000

Runt

ime

(Sec

onds

)

Dataset size (rows)

BigDansingNADEEFPostgreSQL

Spark SQLShark

5 18 8655

368

0.26

4

37

3183

4 8 802 47

4153

0

2000

4000

6000

8000

10000

12000

14000

16000

100,000 200,000 300,000Ru

ntim

e (S

econ

ds)

Dataset size (rows)

BigDansingNADEEFPostgreSQL

Spark SQLShark

10 30 62833

4529

9336

2133

8780

3731

7982

Rule1 Rule2

May16,2017 33/73

0

20000

40000

60000

80000

100000

120000

1M 2M 3MTi

me

(Sec

onds

)Dataset size (rows)

BigDansing-SparkSpark SQLShark

1240

5319

7730

0

5000

10000

15000

20000

10M 20M 40M

Tim

e (S

econ

ds)

Dataset size (rows)

BigDansing-SparkBigDansing-HadoopSpark SQLShark

121

150

337

503

865 23

02

159

313

66237

39

1411

3

1268

22

Performanceona16-machineclusterRule1 Rule2

May16,2017 34/73

0

25000

50000

75000

100000

125000

1 2 4 8 16

Runt

ime

(Sec

onds

)#-workers

BigDansing

Spark SQL

0

40000

80000

120000

160000

200000

647M 959M 1271M1583M1907M

Tim

e (S

econ

ds)

Dataset size (rows)

BigDansing-SparkBigDansing-HadoopSpark SQL

712

2307

5113

8670

1188

0

2480

3

5288

6 9223

6 1389

32

1961

33

9263 1787

2

3019

5

4690

7

6511

5

Performanceona16-machinecluster

May16,2017 35/73

DetectingViolationsonRDF

Scope Block1 Iterate1

Block2 Iterate2

Block3 Iterate3

Detect GenFix

May16,2017 36/73

DetectingViolationsonRDF

0

1000

2000

3000

4000

5000

BigDansing

S2RDFBigDansing

S2RDFBigDansing

S2RDFBigDansing

S2RDF

Runt

ime

(Sec

onds

)

Number of RDF triples

Pre-processingViolation Detection

170M85M42M21M*AlexanderSchätzle, etal., “S2RDF:

RDFQueryingwithSPARQLonSpark”,

inPVLDB2016

* * * *

May16,2017 37/73

BigDansing:ASystemforBigDataCleansing

ü Easy-to-use

ü Efficient

ü Extensible

ü Scalable

*ZuhairKhayyat,etal.,“BigDansing:ASystemforBigDataCleansing”,

inSIGMOD2015.

May16,2017 38/73

IEJoinFastandScalableInequalityJoins


t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

May16,2017 39/73

OCJoin inBigDansing

0

20000

40000

60000

80000

100000

100,000 200,000 300,000

Runt

ime

(Sec

onds

)

Data size (rows)

OCJoinUCrossProductCross product

97 103

1264279

2291

2

6177

2

4953

2707

8

8252

40

20000

40000

60000

80000

100000

120000

1M 2M 3M

Tim

e (S

econ

ds)

Dataset size (rows)

BigDansing-SparkSpark SQLShark

1240

5319

7730

May16,2017 40/73

WhatistheProblem?q Rule2:� t1,t2� D,¬(t1.Salary<t2.Salary˄t1.Rate>t2.Rate)

§ Select*fromDt1JOINDt2ont1.Salary<t2.SalaryAND t1.Tax>t2.Tax

q ProcessedasaCartesianproduct:O(n2)

May16,2017 41/73

RelatedWorkq BandJoin:

§Basedonapointwithinarange:R.A−c1≤S.B&S.B≤R.A+c2

q Intervaljoinintemporalandspatialdata:notgeneral

q Spatialindexing:

§ Largememoryfootprint

§Expensivepreprocessing

May16,2017 42/73

IEJoin – aNewJoinAlgorithmq Indatacleansing:

§ Q1:Select*fromDt1JOINDt2ont1.Salary>t2.SalaryAND t1.Tax<t2.Tax

q Intervalintersection:

§Q2:SELECT*FROMEventsr,EventssWHEREr.start ≤s.end ANDr.end ≥s.start

q Joiningtableswith(≠):

§Qk:SELECT*FROMEventsr,EventssWHEREr.start ≤s.end ANDr.end ≠s.start

May16,2017 43/73

AlgorithmDiscovery

t3(150) t4(120) t1(100) t2(90)

Q1:Select*fromDt1JOINDt2ont1.Salary<t2.SalaryAND t1.Rate>t2.Rate

SortdescendingonSalary:

Salarypartialanswer:(t2,t

1),(t

2,t

4),(t

2,t

3)….(t

4,t

3)

t3(15) t4(10) t2(9) t1(5)

SortdescendingonRate:

Ratepartialanswer:(t1,t

2),(t

1,t

4),(t

1,t

3)….(t

4,t

3)

Salary Rate

t1 100 5

t2 90 9

t3 150 15

t4 120 10

May16,2017 44/73

AlgorithmDiscovery

Salary Rate

t1 100 5

t2 90 9

t3 150 15

t4 120 10

Q1:Select*fromDt1JOINDt2ont1.Salary<t2.SalaryAND t1.Rate>t2.Rate

Ratepartialanswer:

(t1,t2),(t1,t4),(t1,t3),

(t2,t4),(t2,t3),

(t4,t3)}

Salarypartialanswer:

(t2,t1),(t2,t4),(t2,t3),

(t1,t4),(t1,t3),

(t4,t3)}

Theexpectedresultis:(t2,t1)

May16,2017 45/73

IEJoin – theAlgorithmq SortDescendingonSalary:

q SortDescendingonRate:

Salary Rate

t1 100 5

t2 90 9

t3 150 15

t4 120 10

t3(150) t4(120) t1(100) t2(90) 0 1 2 3

PermutationArray

t3(15) t4(10) t2(9) t1(5) 0 1 3 2

0 0 0 0

t3 t4 t2 t1

1 1 11

Sequentialscan

Randomaccess

Result=(t2,t1)

Bit-Array

May16,2017 46/73

SortingOrdersQ1:Select*fromDt1JOINDt2ont1.Salary<t2.SalaryAND t1.Rate>t2.Rate

q Forselfjoins:

§ Salary:ascending orderifOP1iseither>or≥,otherwisedescending order

§ Rate:descending orderifOP1iseither>or≥,otherwiseascending order

§ Non-selfjoins:

§ Salary:descending orderifOP1iseither>or≥,otherwisedescending order

§ Rate:ascending orderifOP1iseither>or≥,otherwisedescending order

OP1 OP2

May16,2017 47/73

Optimizations– BitmapIndex

0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0

0 1 0 0

C1 C2 C3 C4

(i) pos 6 (ii) pos 9

B

max

May16,2017 48/73

Optimizations– NotEqualOperatorq Converteach(≠)intoone(>)andone(<)joinedwithUNIONALLoperator

Qk:SELECT*FROMEventsr,EventssWHEREr.start ≤s.end ANDr.end ≠s.start

Q’k:SELECT*FROMEventsr,EventssWHEREr.start ≤s.end ANDr.end < s.start

UNIONALL

SELECT*FROMEventsr,EventssWHEREr.start ≤s.end ANDr.end >s.start

May16,2017 49/73

Optimizations– SelectivityEstimationq Aquerywiththreeattributes: r.Salary <s.Salary ANDr.Rate >s.Rate ANDr.Age >s.Ageq Usesamplingtoestimatethemaximumoutputsize– Est(Salary,Rate),Est(Salary,Tax),Est(Tax,Age)

Range

Partitioning

Sorting

Pruning

Calculate

MaxOutput

Partition1 Partition2 Partition3 PartitionnBasedon

OP1

Basedon

OP2Partition1 Partition2 Partition3 Partitionn

Partition1

Partition2

Partition3Partition4

Partition5

Partition6 Partitionn

EstimatedOutput=numberofoverlappingpartitions=2

May16,2017 50/73

IEJoin andBigDansing

May16,2017 51/73

SerialIEJoin vs.NaïveBaseline

0.010.1

110

1001000

10000

10K 50K 100K

Runt

ime

(Sec

onds

)

Input size

PG-IEJoinPG-Original

MonetDBDBMS-X

0.010.1

110

1001000

10000

10K 50K 100K

Runt

ime

(Sec

onds

)Input size

PG-IEJoinPG-Original

MonetDBDBMS-X

Salary-Rate IntervalIntersection

May16,2017 52/73

0

2000

4000

6000

8000

10000

PG-IEJoinPG-GiST

PG-BTreePG-IEJoin

PG-GiSTPG-BTree

Runt

ime

(Sec

onds

)

Indexing QueryingX

146

3928

X

310

6287

Q2Q1

SerialIEJoin vs.PostgreswithIndex– 50MRows

16workers1workers

GiST: Generalized Search Tree

May16,2017 53/73

ParallelandDistributedIEJoin – 100MRows

040008000

120001600020000

Parallel-IEJoin

Distributed-IEJoin

DPG-GiST

DPG-BTree

SparkSQL-SM

SparkSQL

Runt

ime

(Sec

onds

) Indexing QueryingX X X X

4302

1313

040008000

120001600020000

Parallel-IEJoin

Distributed-IEJoin

DPG-GiST

DPG-BTree

SparkSQL-SM

SparkSQL

Runt

ime

(Sec

onds

) Indexing QueryingX X X

4965

1376

Salary-Rate IntervalIntersection

May16,2017 54/73

IEJoinq Anewjoinalgorithm

q Basedonconditions:(<,≤,>,≥,≠)

q Extremelyfastandhighlyscalable

q Utilizessortingandefficientdatastructures

q EasytoimplementintraditionalDBMSanddistributedsystems

*ZuhairKhayyat,etal., “FastandScalableInequalityJoins”,TheVLDBJournal2017,SpecialIssue:BestPapersofVLDB2015

*ZuhairKhayyat ,etal.,“LightningFastandSpaceEfficientInequalityJoins”,inPVLDB2015

May16,2017 55/73

MizanASystemforDynamicLoad

BalancinginLarge-scaleGraph

Processing

May16,2017 56/73

BigDansing’s implementations


t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Detection

RulesInputData

Dirty

1st:Detect


t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Dirty

3rd:Repair


t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

Dirty

4th:UpdateInputData


t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

DirtyDirty


t1 Annie 10001 NY NY 24000 15

t2 Laure 90210 LA CA 25000 10

t3 John 60601 CH IL 40000 25

t4 Mark 90210 SF CA 88000 28

t5 Robert 60827 CH IL 15000 15

t6 Mary 90210 LA CA 81000 28

CleanData

BigDansing

Apache

Hadoop

Giraph

Apache

Spark

GraphX

HDFS

2st:Analyze

May16,2017 57/73

Pregel*/Giraph Abstractionq Basedonvertex-centriccomputation

q Abstraction:

§ compute(),combine()&aggregate()

q Synchronousin-memorybulk

synchronousparallel(BSP)

* G. Malewicz, et al., “Pregel: A System for Large-Scale Graph Processing,” inSIGMOD2010

Superstep 1 Superstep 2 Superstep 3

Worker 3

Worker 2

Worker 1

Worker 3

Worker 2

Worker 1

Worker 3

Worker 2

Worker 1

BSP Barrier

May16,2017 58/73

ProblemsofGiraphTheLightSide TheDarkSide

§ Algorithm:

§ Unforeseen

§ Structure:

§ Variable

§ Algorithm:

§ Predictable

§ Structure:

§ Fixed

Errorgraph(violationgraph)israndom,bigandunpredictable

May16,2017 59/73

HowGiraph Optimize Computations1. FasterGraphLoading

§ Simplegraphpartitioning

§ Hash,Range

2. Optimizedforgraphstructure

§ Sophisticatedandexpensive

partitioningtechniques

§ Min-cuts

0

50

100

150

200

250

300

350

LiveJournalkgraph4m68m

arabic-2005Ru

n Ti

me

(Min

)

HashRange

Min-cuts

Theruntimeofasingleiterationis

asfastastheslowestworker

May16,2017 60/73

BehaviorsofDifferentGraphAlgorithms

0.0010.01

0.11

10100

1000

0 10 20 30 40 50 60

In M

essa

ges

(Mill

ions

)

SuperSteps

PageRank - TotalPageRank - Max/W

DMST - TotalDMST - Max/W

PageRankvs.DistributedMinimalSpanningTree

May16,2017 61/73

SourceofImbalanceinGiraph1. Highvertexresponsetime

2. Longtimetoreceiveincomingmessages

3. Longtimetosendoutgoingmessages

Superstep 1

-High vertex response time

Worker 3

Worker 2

Worker 1

Worker 3

Worker 2

Worker 1

BSP Barrier

Superstep 1

-Long time to receive in messages

Worker 3

Worker 2

Worker 1

Worker 3

Worker 2

Worker 1

BSP Barrier

Superstep 1

-Long time to send out messages

Worker 3

Worker 2

Worker 1

Worker 3

Worker 2

Worker 1

BSP Barrier

May16,2017 62/73

Mizan – SolvingtheWorkloadImbalanceqMoveverticesbetweenworkersduringruntime

q PlanningandvertexmigrationswithintheBSPbarrierto

maintaincomputationconsistency

Superstep 1 Superstep 2 Superstep 3

Worker 3

Worker 2

Worker 1

Worker 3

Worker 2

Worker 1

Worker 3

Worker 2

Worker 1

BSP BarrierMigration Barrier Migration Planner

Communicator - DHT

Vertex Compute()

BSP Graph Processor

Storage Manager

HDFS/Local Disks

IO

Mizan Worker

Load Balancer: Migration Planner

May16,2017 63/73

Mizan’s MigrationPlanningSteps1. Identifythesourceofworkloadimbalanceacrossworkers

§ Remoteoutgoingmessages

§ Allincomingmessages

§ Responsetime V1

Worker 2Worker 1Remote Incoming Messages

Remote Outgoing MessagesVertex

Response Time

V3V2V4

Mizan

V5V6

Mizan

Local Incoming Messages

May16,2017 64/73


2. Selectthemigrationobjectivethroughastatisticalanalysis

§ Optimizeforoutgoingmessages, or

§ Optimizeforincomingmessages,or

§ Optimizeforresponsetime

May16,2017 65/73



3. Pairover-utilizedworkerswithunder-utilizedones

W7 W2 W1 W5 W8 W4 W0 W6 W3

0 1 2 3 4 5 6 7 8

W9

May16,2017 66/73



3. Pairover-utilizedworkerswithunder-utilizedones

4. Selectverticestomigrate

§ Selecttheleastnumberofverticesthathasthehighestimpact

§ Vertexownership:distributedhashtable(DHT)

§ Delayedmigration:reducemigrationcost

May16,2017 67/73

05

10152025303540

Stat

icW

SM

izan

Stat

icW

SM

izan

Stat

icW

SM

izan

Runt

ime

(Min

)

MetisRangeHash

PerformanceofMizan onPageRank

May16,2017 68/73

0

50

100

150

200

250

300

AdvertismentDMST

Runt

ime

(Min

)

StaticWork Stealing

Mizan

PerformanceofMizan withMetis

May16,2017 69/73

Mizan – aGeneralGraphProcessingSystemq APregel-clone

§ Supportsverylargegraphs

§ Runsonverylargeclusters

q Dynamicfine-grainedvertexmigrationsto

balancecomputationandcommunication

q Optimizedforpredictableandnon-

predictablegraphalgorithmsandstructures

BigDansing

Apache

Spark

Mizan

GraphX

*ZuhairKhayyat,etal.,“Mizan:ASystemforDynamicLoad

BalancinginLarge-scaleGraphProcessing”,inEuroSys 2013

GiraphHDFS

May16,2017 70/73

Summary• Ageneralsystemforbigdatacleansing

• Performanceupto2ordersofmagnitudefaster

• SIGMOD2015

§ Anovelalgorithmforfastinequalityjoins

§ Performanceleast2ordersofmagnitude

faster

§ PVLDB2015&VLDBJ2017

§ Ageneralsystemfordistributedgraph

processing

§ Performanceimprovementsupto84%

§ EuroSys 2013

May16,2017 71/73

Publications" ZuhairKhayyat,WilliamLucia,Meghna Singh,Mourad Ouzzani,PaoloPapotti,Jorge-ArnulfoQuiané-Ruiz,NanTang,Panos

Kalnis,“FastandScalableInequalityJoins”,TheVLDBJournal2017 specialissue:BestPapersofVLDB2015.

" Divy Agrawal,Lamine Ba,LaureBerti-Equille,SanjayChawla,AhmedElmagarmid,Hossam Hammady,YasserIdris,Zoi

Kaoudi,ZuhairKhayyat, SebastianKruse,Mourad Ouzzani,PaoloPapotti,Jorge-ArnulfoQuiané-Ruiz,NanTang,MohammedJ.

Zaki,“Rheem:EnablingMulti-PlatformTaskExecution”,inSIGMOD2016.

" ZuhairKhayyat,WilliamLucia,Meghna Singh,Mourad Ouzzani,PaoloPapotti,Jorge-ArnulfoQuiané-Ruiz,NanTang,Panos

Kalnis,“LightningFastandSpaceEfficientInequalityJoins”,inPVLDB2015.

" ZuhairKhayyat,Ihab F.Ilyas,Alekh Jindal,SamuelMadden,Mourad Ouzzani,PaoloPapotti,Jorge-ArnulfoQuiané-Ruiz,Nan

Tang,SiYin,“BigDansing:ASystemforBigDataCleansing”,inSIGMOD2015.

" ZuhairKhayyat,KarimAwara,AmaniAlonazi,HaniJamjoom,DanWilliams,Panos Kalnis,“Mizan:ASystemforDynamicLoadBalancinginLarge-scaleGraphProcessing”,inEuroSys 2013.

May16,2017 73/73

scaling big data cleansing

Data & Analytics