learning-based data cleaning

Learning-based Data Cleaning

Christian Stade-SchuldtFreie Universität Berlin

Thesis Talk, 16.12.2009

Outline

Motivation

BackgroundThe Data Cleaning ProcessReview of Machine Learning Techniques

Data-Cleaning Workflow

ResultsClusteringClassification

Summary

,

FU Berlin, Diploma Thesis, 16.12.2009 2

Why Data Cleaning?

É Many sources lead to different formats and standards.

É Migration becomes a costly issueÉ Built-in database techniques are not capabable of

dealing with dirty data.

,


Why Data Cleaning?

É Many sources lead to different formats and standards.É Migration becomes a costly issue

É Built-in database techniques are not capabable ofdealing with dirty data.

,


Why Data Cleaning?

É Many sources lead to different formats and standards.É Migration becomes a costly issueÉ Built-in database techniques are not capabable of

dealing with dirty data.

,


Benefits of Data Cleaning

É Less time for data maintenance⇒ more time for key job functions

É Removal of data inconsistenciesÉ More complete and accurate data sourcesÉ Identify organizational, process and data issues⇒ enforce standards

,




É Removal of data inconsistencies

É More complete and accurate data sourcesÉ Identify organizational, process and data issues⇒ enforce standards

,




É Removal of data inconsistenciesÉ More complete and accurate data sources

É Identify organizational, process and data issues⇒ enforce standards

,




É Removal of data inconsistenciesÉ More complete and accurate data sourcesÉ Identify organizational, process and data issues⇒ enforce standards

,


Outline

Motivation




Summary

,


The Data Cleaning Process

,


Record Matching and Record Merging

É Simple: Record Matching based on a key or a set of rules

É Difficult: Record Matching without a key

É Database operations are primarily restricted to joins onfields and simple pattern matching.

,


Record Matching and Record Merging

É Simple: Record Matching based on a key or a set of rulesÉ Difficult: Record Matching without a key

É Database operations are primarily restricted to joins onfields and simple pattern matching.

,


Outline

Motivation




Summary

,


Canopy Clustering

Canopy Clustering allows efficient clustering of data sourceswhichÉ are large

É have records with a lot of attributesÉ result in a lot of clusters

,


Canopy Clustering

Canopy Clustering allows efficient clustering of data sourceswhichÉ are largeÉ have records with a lot of attributes

É result in a lot of clusters

,


Canopy Clustering

Canopy Clustering allows efficient clustering of data sourceswhichÉ are largeÉ have records with a lot of attributesÉ result in a lot of clusters

,


Canopy Clustering

Idea: Apply a cheap distance measure to cluster the datainto overlapping canopies.

,


Canopy Clustering Distance Measure

É Use reverse indexing as a rough clustering constraint.

É Jaccard similarity coefficient: J(A,B) = |A∩B||A∪B|É The ratio between the number of word matches and the

number of total words between two records determineshow similar the records are.

,



É Use reverse indexing as a rough clustering constraint.É Jaccard similarity coefficient: J(A,B) = |A∩B||A∪B|

É The ratio between the number of word matches and thenumber of total words between two records determineshow similar the records are.

,



É Use reverse indexing as a rough clustering constraint.É Jaccard similarity coefficient: J(A,B) = |A∩B||A∪B|É The ratio between the number of word matches and the

number of total words between two records determineshow similar the records are.

,


Support Vector Machines

É Maximize themargin m = 2

||w||É Kernel trickÉ Black box

technique

,



,


Model Generator

É Strings

É AbbreviationDetection

É Normalized EditDistance

É Learning StringEdit Distance

É Rule EngineÉ NumbersÉ Dates

,


Model Generator

É StringsÉ Abbreviation

Detection

É Normalized EditDistance



,


Model Generator


DetectionÉ Normalized Edit

Distance



,


Model Generator



DistanceÉ Learning String

Edit Distance


,


Model Generator




Edit DistanceÉ Rule Engine

É NumbersÉ Dates

,


Model Generator





É Numbers

É Dates

,


Model Generator





É NumbersÉ Dates

,


Clustering the data

,


Classification and Backflow

,


Outline

Motivation




Summary

,


Clustering Results

É Find "best" features and parametersÉ Trade-off between quality and size of the search space

,


Outline

Motivation




Summary

,


Classification Results for Dataset IHow does the number of training samples affect the results?

,


Classification Results for Dataset IIHow does the computation of features affect the results?

,


Summary

É Data Cleaning using Clustering and ClassificationÉ Business Value: Reduced Manpower + Improved Data

Quality

É Future WorkÉ Improved featuresÉ Automatic selection of parametersÉ Scalability

,


learning-based data cleaning

Data & Analytics

diploma thesis

data maintenance

benets of data cleaning

data issues

dirty data

key fu berlin

accurate data sources

key job functions fu