learning-based data cleaning

40
Learning-based Data Cleaning Christian Stade-Schuldt Freie Universität Berlin Thesis Talk, 16.12.2009

Upload: christian-stade-schuldt

Post on 10-Aug-2015

21 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Learning-based Data Cleaning

Learning-based Data Cleaning

Christian Stade-SchuldtFreie Universität Berlin

Thesis Talk, 16.12.2009

Page 2: Learning-based Data Cleaning

Outline

Motivation

BackgroundThe Data Cleaning ProcessReview of Machine Learning Techniques

Data-Cleaning Workflow

ResultsClusteringClassification

Summary

,

FU Berlin, Diploma Thesis, 16.12.2009 2

Page 3: Learning-based Data Cleaning

Why Data Cleaning?

É Many sources lead to different formats and standards.

É Migration becomes a costly issueÉ Built-in database techniques are not capabable of

dealing with dirty data.

,

FU Berlin, Diploma Thesis, 16.12.2009 3

Page 4: Learning-based Data Cleaning

Why Data Cleaning?

É Many sources lead to different formats and standards.É Migration becomes a costly issue

É Built-in database techniques are not capabable ofdealing with dirty data.

,

FU Berlin, Diploma Thesis, 16.12.2009 3

Page 5: Learning-based Data Cleaning

Why Data Cleaning?

É Many sources lead to different formats and standards.É Migration becomes a costly issueÉ Built-in database techniques are not capabable of

dealing with dirty data.

,

FU Berlin, Diploma Thesis, 16.12.2009 3

Page 6: Learning-based Data Cleaning

Benefits of Data Cleaning

É Less time for data maintenance⇒ more time for key job functions

É Removal of data inconsistenciesÉ More complete and accurate data sourcesÉ Identify organizational, process and data issues⇒ enforce standards

,

FU Berlin, Diploma Thesis, 16.12.2009 4

Page 7: Learning-based Data Cleaning

Benefits of Data Cleaning

É Less time for data maintenance⇒ more time for key job functions

É Removal of data inconsistencies

É More complete and accurate data sourcesÉ Identify organizational, process and data issues⇒ enforce standards

,

FU Berlin, Diploma Thesis, 16.12.2009 4

Page 8: Learning-based Data Cleaning

Benefits of Data Cleaning

É Less time for data maintenance⇒ more time for key job functions

É Removal of data inconsistenciesÉ More complete and accurate data sources

É Identify organizational, process and data issues⇒ enforce standards

,

FU Berlin, Diploma Thesis, 16.12.2009 4

Page 9: Learning-based Data Cleaning

Benefits of Data Cleaning

É Less time for data maintenance⇒ more time for key job functions

É Removal of data inconsistenciesÉ More complete and accurate data sourcesÉ Identify organizational, process and data issues⇒ enforce standards

,

FU Berlin, Diploma Thesis, 16.12.2009 4

Page 10: Learning-based Data Cleaning

Outline

Motivation

BackgroundThe Data Cleaning ProcessReview of Machine Learning Techniques

Data-Cleaning Workflow

ResultsClusteringClassification

Summary

,

FU Berlin, Diploma Thesis, 16.12.2009 5

Page 11: Learning-based Data Cleaning

The Data Cleaning Process

,

FU Berlin, Diploma Thesis, 16.12.2009 6

Page 12: Learning-based Data Cleaning

Record Matching and Record Merging

É Simple: Record Matching based on a key or a set of rules

É Difficult: Record Matching without a key

É Database operations are primarily restricted to joins onfields and simple pattern matching.

,

FU Berlin, Diploma Thesis, 16.12.2009 7

Page 13: Learning-based Data Cleaning

Record Matching and Record Merging

É Simple: Record Matching based on a key or a set of rulesÉ Difficult: Record Matching without a key

É Database operations are primarily restricted to joins onfields and simple pattern matching.

,

FU Berlin, Diploma Thesis, 16.12.2009 7

Page 14: Learning-based Data Cleaning

Record Matching and Record Merging

É Simple: Record Matching based on a key or a set of rulesÉ Difficult: Record Matching without a key

É Database operations are primarily restricted to joins onfields and simple pattern matching.

,

FU Berlin, Diploma Thesis, 16.12.2009 7

Page 15: Learning-based Data Cleaning

Outline

Motivation

BackgroundThe Data Cleaning ProcessReview of Machine Learning Techniques

Data-Cleaning Workflow

ResultsClusteringClassification

Summary

,

FU Berlin, Diploma Thesis, 16.12.2009 8

Page 16: Learning-based Data Cleaning

Canopy Clustering

Canopy Clustering allows efficient clustering of data sourceswhichÉ are large

É have records with a lot of attributesÉ result in a lot of clusters

,

FU Berlin, Diploma Thesis, 16.12.2009 9

Page 17: Learning-based Data Cleaning

Canopy Clustering

Canopy Clustering allows efficient clustering of data sourceswhichÉ are largeÉ have records with a lot of attributes

É result in a lot of clusters

,

FU Berlin, Diploma Thesis, 16.12.2009 9

Page 18: Learning-based Data Cleaning

Canopy Clustering

Canopy Clustering allows efficient clustering of data sourceswhichÉ are largeÉ have records with a lot of attributesÉ result in a lot of clusters

,

FU Berlin, Diploma Thesis, 16.12.2009 9

Page 19: Learning-based Data Cleaning

Canopy Clustering

Idea: Apply a cheap distance measure to cluster the datainto overlapping canopies.

,

FU Berlin, Diploma Thesis, 16.12.2009 10

Page 20: Learning-based Data Cleaning

Canopy Clustering Distance Measure

É Use reverse indexing as a rough clustering constraint.

É Jaccard similarity coefficient: J(A,B) = |A∩B||A∪B|É The ratio between the number of word matches and the

number of total words between two records determineshow similar the records are.

,

FU Berlin, Diploma Thesis, 16.12.2009 11

Page 21: Learning-based Data Cleaning

Canopy Clustering Distance Measure

É Use reverse indexing as a rough clustering constraint.É Jaccard similarity coefficient: J(A,B) = |A∩B||A∪B|

É The ratio between the number of word matches and thenumber of total words between two records determineshow similar the records are.

,

FU Berlin, Diploma Thesis, 16.12.2009 11

Page 22: Learning-based Data Cleaning

Canopy Clustering Distance Measure

É Use reverse indexing as a rough clustering constraint.É Jaccard similarity coefficient: J(A,B) = |A∩B||A∪B|É The ratio between the number of word matches and the

number of total words between two records determineshow similar the records are.

,

FU Berlin, Diploma Thesis, 16.12.2009 11

Page 23: Learning-based Data Cleaning

Support Vector Machines

É Maximize themargin m = 2

||w||É Kernel trickÉ Black box

technique

,

FU Berlin, Diploma Thesis, 16.12.2009 12

Page 24: Learning-based Data Cleaning

Data-Cleaning Workflow

,

FU Berlin, Diploma Thesis, 16.12.2009 13

Page 25: Learning-based Data Cleaning

Model Generator

É Strings

É AbbreviationDetection

É Normalized EditDistance

É Learning StringEdit Distance

É Rule EngineÉ NumbersÉ Dates

,

FU Berlin, Diploma Thesis, 16.12.2009 14

Page 26: Learning-based Data Cleaning

Model Generator

É StringsÉ Abbreviation

Detection

É Normalized EditDistance

É Learning StringEdit Distance

É Rule EngineÉ NumbersÉ Dates

,

FU Berlin, Diploma Thesis, 16.12.2009 14

Page 27: Learning-based Data Cleaning

Model Generator

É StringsÉ Abbreviation

DetectionÉ Normalized Edit

Distance

É Learning StringEdit Distance

É Rule EngineÉ NumbersÉ Dates

,

FU Berlin, Diploma Thesis, 16.12.2009 14

Page 28: Learning-based Data Cleaning

Model Generator

É StringsÉ Abbreviation

DetectionÉ Normalized Edit

DistanceÉ Learning String

Edit Distance

É Rule EngineÉ NumbersÉ Dates

,

FU Berlin, Diploma Thesis, 16.12.2009 14

Page 29: Learning-based Data Cleaning

Model Generator

É StringsÉ Abbreviation

DetectionÉ Normalized Edit

DistanceÉ Learning String

Edit DistanceÉ Rule Engine

É NumbersÉ Dates

,

FU Berlin, Diploma Thesis, 16.12.2009 14

Page 30: Learning-based Data Cleaning

Model Generator

É StringsÉ Abbreviation

DetectionÉ Normalized Edit

DistanceÉ Learning String

Edit DistanceÉ Rule Engine

É Numbers

É Dates

,

FU Berlin, Diploma Thesis, 16.12.2009 14

Page 31: Learning-based Data Cleaning

Model Generator

É StringsÉ Abbreviation

DetectionÉ Normalized Edit

DistanceÉ Learning String

Edit DistanceÉ Rule Engine

É NumbersÉ Dates

,

FU Berlin, Diploma Thesis, 16.12.2009 14

Page 32: Learning-based Data Cleaning

Model Generator

É StringsÉ Abbreviation

DetectionÉ Normalized Edit

DistanceÉ Learning String

Edit DistanceÉ Rule Engine

É NumbersÉ Dates

,

FU Berlin, Diploma Thesis, 16.12.2009 14

Page 33: Learning-based Data Cleaning

Clustering the data

,

FU Berlin, Diploma Thesis, 16.12.2009 15

Page 34: Learning-based Data Cleaning

Classification and Backflow

,

FU Berlin, Diploma Thesis, 16.12.2009 16

Page 35: Learning-based Data Cleaning

Outline

Motivation

BackgroundThe Data Cleaning ProcessReview of Machine Learning Techniques

Data-Cleaning Workflow

ResultsClusteringClassification

Summary

,

FU Berlin, Diploma Thesis, 16.12.2009 17

Page 36: Learning-based Data Cleaning

Clustering Results

É Find "best" features and parametersÉ Trade-off between quality and size of the search space

,

FU Berlin, Diploma Thesis, 16.12.2009 18

Page 37: Learning-based Data Cleaning

Outline

Motivation

BackgroundThe Data Cleaning ProcessReview of Machine Learning Techniques

Data-Cleaning Workflow

ResultsClusteringClassification

Summary

,

FU Berlin, Diploma Thesis, 16.12.2009 19

Page 38: Learning-based Data Cleaning

Classification Results for Dataset IHow does the number of training samples affect the results?

,

FU Berlin, Diploma Thesis, 16.12.2009 20

Page 39: Learning-based Data Cleaning

Classification Results for Dataset IIHow does the computation of features affect the results?

,

FU Berlin, Diploma Thesis, 16.12.2009 21

Page 40: Learning-based Data Cleaning

Summary

É Data Cleaning using Clustering and ClassificationÉ Business Value: Reduced Manpower + Improved Data

Quality

É Future WorkÉ Improved featuresÉ Automatic selection of parametersÉ Scalability

,

FU Berlin, Diploma Thesis, 16.12.2009 22