learning-based data cleaning
TRANSCRIPT
Learning-based Data Cleaning
Christian Stade-SchuldtFreie Universität Berlin
Thesis Talk, 16.12.2009
Outline
Motivation
BackgroundThe Data Cleaning ProcessReview of Machine Learning Techniques
Data-Cleaning Workflow
ResultsClusteringClassification
Summary
,
FU Berlin, Diploma Thesis, 16.12.2009 2
Why Data Cleaning?
É Many sources lead to different formats and standards.
É Migration becomes a costly issueÉ Built-in database techniques are not capabable of
dealing with dirty data.
,
FU Berlin, Diploma Thesis, 16.12.2009 3
Why Data Cleaning?
É Many sources lead to different formats and standards.É Migration becomes a costly issue
É Built-in database techniques are not capabable ofdealing with dirty data.
,
FU Berlin, Diploma Thesis, 16.12.2009 3
Why Data Cleaning?
É Many sources lead to different formats and standards.É Migration becomes a costly issueÉ Built-in database techniques are not capabable of
dealing with dirty data.
,
FU Berlin, Diploma Thesis, 16.12.2009 3
Benefits of Data Cleaning
É Less time for data maintenance⇒ more time for key job functions
É Removal of data inconsistenciesÉ More complete and accurate data sourcesÉ Identify organizational, process and data issues⇒ enforce standards
,
FU Berlin, Diploma Thesis, 16.12.2009 4
Benefits of Data Cleaning
É Less time for data maintenance⇒ more time for key job functions
É Removal of data inconsistencies
É More complete and accurate data sourcesÉ Identify organizational, process and data issues⇒ enforce standards
,
FU Berlin, Diploma Thesis, 16.12.2009 4
Benefits of Data Cleaning
É Less time for data maintenance⇒ more time for key job functions
É Removal of data inconsistenciesÉ More complete and accurate data sources
É Identify organizational, process and data issues⇒ enforce standards
,
FU Berlin, Diploma Thesis, 16.12.2009 4
Benefits of Data Cleaning
É Less time for data maintenance⇒ more time for key job functions
É Removal of data inconsistenciesÉ More complete and accurate data sourcesÉ Identify organizational, process and data issues⇒ enforce standards
,
FU Berlin, Diploma Thesis, 16.12.2009 4
Outline
Motivation
BackgroundThe Data Cleaning ProcessReview of Machine Learning Techniques
Data-Cleaning Workflow
ResultsClusteringClassification
Summary
,
FU Berlin, Diploma Thesis, 16.12.2009 5
The Data Cleaning Process
,
FU Berlin, Diploma Thesis, 16.12.2009 6
Record Matching and Record Merging
É Simple: Record Matching based on a key or a set of rules
É Difficult: Record Matching without a key
É Database operations are primarily restricted to joins onfields and simple pattern matching.
,
FU Berlin, Diploma Thesis, 16.12.2009 7
Record Matching and Record Merging
É Simple: Record Matching based on a key or a set of rulesÉ Difficult: Record Matching without a key
É Database operations are primarily restricted to joins onfields and simple pattern matching.
,
FU Berlin, Diploma Thesis, 16.12.2009 7
Record Matching and Record Merging
É Simple: Record Matching based on a key or a set of rulesÉ Difficult: Record Matching without a key
É Database operations are primarily restricted to joins onfields and simple pattern matching.
,
FU Berlin, Diploma Thesis, 16.12.2009 7
Outline
Motivation
BackgroundThe Data Cleaning ProcessReview of Machine Learning Techniques
Data-Cleaning Workflow
ResultsClusteringClassification
Summary
,
FU Berlin, Diploma Thesis, 16.12.2009 8
Canopy Clustering
Canopy Clustering allows efficient clustering of data sourceswhichÉ are large
É have records with a lot of attributesÉ result in a lot of clusters
,
FU Berlin, Diploma Thesis, 16.12.2009 9
Canopy Clustering
Canopy Clustering allows efficient clustering of data sourceswhichÉ are largeÉ have records with a lot of attributes
É result in a lot of clusters
,
FU Berlin, Diploma Thesis, 16.12.2009 9
Canopy Clustering
Canopy Clustering allows efficient clustering of data sourceswhichÉ are largeÉ have records with a lot of attributesÉ result in a lot of clusters
,
FU Berlin, Diploma Thesis, 16.12.2009 9
Canopy Clustering
Idea: Apply a cheap distance measure to cluster the datainto overlapping canopies.
,
FU Berlin, Diploma Thesis, 16.12.2009 10
Canopy Clustering Distance Measure
É Use reverse indexing as a rough clustering constraint.
É Jaccard similarity coefficient: J(A,B) = |A∩B||A∪B|É The ratio between the number of word matches and the
number of total words between two records determineshow similar the records are.
,
FU Berlin, Diploma Thesis, 16.12.2009 11
Canopy Clustering Distance Measure
É Use reverse indexing as a rough clustering constraint.É Jaccard similarity coefficient: J(A,B) = |A∩B||A∪B|
É The ratio between the number of word matches and thenumber of total words between two records determineshow similar the records are.
,
FU Berlin, Diploma Thesis, 16.12.2009 11
Canopy Clustering Distance Measure
É Use reverse indexing as a rough clustering constraint.É Jaccard similarity coefficient: J(A,B) = |A∩B||A∪B|É The ratio between the number of word matches and the
number of total words between two records determineshow similar the records are.
,
FU Berlin, Diploma Thesis, 16.12.2009 11
Support Vector Machines
É Maximize themargin m = 2
||w||É Kernel trickÉ Black box
technique
,
FU Berlin, Diploma Thesis, 16.12.2009 12
Data-Cleaning Workflow
,
FU Berlin, Diploma Thesis, 16.12.2009 13
Model Generator
É Strings
É AbbreviationDetection
É Normalized EditDistance
É Learning StringEdit Distance
É Rule EngineÉ NumbersÉ Dates
,
FU Berlin, Diploma Thesis, 16.12.2009 14
Model Generator
É StringsÉ Abbreviation
Detection
É Normalized EditDistance
É Learning StringEdit Distance
É Rule EngineÉ NumbersÉ Dates
,
FU Berlin, Diploma Thesis, 16.12.2009 14
Model Generator
É StringsÉ Abbreviation
DetectionÉ Normalized Edit
Distance
É Learning StringEdit Distance
É Rule EngineÉ NumbersÉ Dates
,
FU Berlin, Diploma Thesis, 16.12.2009 14
Model Generator
É StringsÉ Abbreviation
DetectionÉ Normalized Edit
DistanceÉ Learning String
Edit Distance
É Rule EngineÉ NumbersÉ Dates
,
FU Berlin, Diploma Thesis, 16.12.2009 14
Model Generator
É StringsÉ Abbreviation
DetectionÉ Normalized Edit
DistanceÉ Learning String
Edit DistanceÉ Rule Engine
É NumbersÉ Dates
,
FU Berlin, Diploma Thesis, 16.12.2009 14
Model Generator
É StringsÉ Abbreviation
DetectionÉ Normalized Edit
DistanceÉ Learning String
Edit DistanceÉ Rule Engine
É Numbers
É Dates
,
FU Berlin, Diploma Thesis, 16.12.2009 14
Model Generator
É StringsÉ Abbreviation
DetectionÉ Normalized Edit
DistanceÉ Learning String
Edit DistanceÉ Rule Engine
É NumbersÉ Dates
,
FU Berlin, Diploma Thesis, 16.12.2009 14
Model Generator
É StringsÉ Abbreviation
DetectionÉ Normalized Edit
DistanceÉ Learning String
Edit DistanceÉ Rule Engine
É NumbersÉ Dates
,
FU Berlin, Diploma Thesis, 16.12.2009 14
Clustering the data
,
FU Berlin, Diploma Thesis, 16.12.2009 15
Classification and Backflow
,
FU Berlin, Diploma Thesis, 16.12.2009 16
Outline
Motivation
BackgroundThe Data Cleaning ProcessReview of Machine Learning Techniques
Data-Cleaning Workflow
ResultsClusteringClassification
Summary
,
FU Berlin, Diploma Thesis, 16.12.2009 17
Clustering Results
É Find "best" features and parametersÉ Trade-off between quality and size of the search space
,
FU Berlin, Diploma Thesis, 16.12.2009 18
Outline
Motivation
BackgroundThe Data Cleaning ProcessReview of Machine Learning Techniques
Data-Cleaning Workflow
ResultsClusteringClassification
Summary
,
FU Berlin, Diploma Thesis, 16.12.2009 19
Classification Results for Dataset IHow does the number of training samples affect the results?
,
FU Berlin, Diploma Thesis, 16.12.2009 20
Classification Results for Dataset IIHow does the computation of features affect the results?
,
FU Berlin, Diploma Thesis, 16.12.2009 21
Summary
É Data Cleaning using Clustering and ClassificationÉ Business Value: Reduced Manpower + Improved Data
Quality
É Future WorkÉ Improved featuresÉ Automatic selection of parametersÉ Scalability
,
FU Berlin, Diploma Thesis, 16.12.2009 22