beyond kaggle: solving data science challenges at scale
TRANSCRIPT
1
DRAFT
Think Big, Start Smart, Scale Fast
Dato ConferenceData Matching and Deduplication
using Dato ToolkitsJuly 21st, 2015
Guillermo Breto Rangel, PhD
2
DRAFT
Entity Resolution: Multiple Definitions
2
(ER)Entity Resolution
Extract, match and disambiguate entity records in data.
3
DRAFT
Extract, match and disambiguate entity records in data.
Entity Resolution: Real World Entity
Matching real world entities with profiles, mentions...
You
Facebook account(s)LinkedIn profile(s)TweetsGoogle Searches
Many recordsUnique Identities…
...…...
......
ER
4
DRAFT
Entity Resolution: Use Cases
4
◆ Network Analysis ◆ Vocabulary Normalization:
Different organizations report different names for same entities
◆ Network Security: Finding user actions/intents
◆ Data Cleaning: removing duplicated records
◆ Metadata enrichment: records when matched append metadata to the entity.
5
DRAFT
Entity Resolution: Challenges
5
◆ Missing Values
◆ Data entry errors
◆ Abbreviations and formatting
◆ Data volume
◆ Variety of raw data sourceso free text, semi-structured, streaming
◆ Data integration from multiple sources
◆ Preprocessing
◆ Normalization
◆ Choosing similarity metrics
6
DRAFT
Dataset: Dbpedia/Amazon-Google Products
6
Putting a schema to WikipediaCrowd-sourced community project
Queries against WikipediaData Match data sets on the Web to Wikipedia data
A set of triples → <dbpedia:Luc_Besson> <dbpedia-owl:spouse><dbpedia:Milla_Jovovich>
Matching Amazon Products and Google Products
Deich Library and
7
DRAFT
Preprocessing: Steps
7
1) Extracttokens
2) Cleantriplets
3) Pivottable
4) Selectrelevantfeatures
5) Normalization
6) Choosingsimilaritymetrics
8
DRAFT
Algorithm: Nearest Neighbors
8
● The entity resolution problem is approached as a network problem○ Nodes: entity records○ Edges: similarity measures
● Define distance between entities to find the nearest neighbors. Composite distances could be built using euclidean, squared euclidean, levenshtein, Jaccard, Manhattan, cosine, dot product
● Compute the distance between all entities and find the nearest neighbors
● Duplicates are the connected components of the graph which are labeled as an entity
● Some parameters to keep in mind are:○ Grouping_features○ k (number of neighbors to compare)○ Radius (the distance threshold)
9
DRAFT
Results:
9
The benchmark results can be found at:
https://github.com/cubreto/dataDeduplication
10
DRAFT
Lessons Learned:
10
◆ Most of the time spent on preprocessing
◆ Hard to define the distance threshold
◆ Weighting the composite distance
◆ Data volume
◆ Dealing with missing values
◆ Tuning the parameters
◆ Finding exact matches
11
DRAFT
Some Resources/Bibliography
11
◆ Ricardo Vasquez Sierra, PhD: Senior Data Scientist from Ooyala
◆ Kevin Glynn, MS: Data Scientist and Khan Academy Instructor
◆ Vince Gonzalez: MapR Software Engineer◆ Alexey Svyatkovskiy, PhD: BigData Scientist
Princeton University◆ Ashwin Machanavajjhala, PhD: Professor of
Computer Science, Duke University◆ Lise Getoor, PhD: Professor of Computer
Science, UC Santa Cruzo KDDTutorialonEntityResolution inBigDatao Deduplication and Group Detection using Links, Indrajit
Bhattacharya and Lise Getoor, The 10th ACM SIGKDD Workshop on
Link Analysis and Group Detection (LinkKDD-04).
o Collective Entity Resolution in Relational Data, Indrajit Bhattacharya
and Lise Getoor, ACM Transactions on Knowledge Discovery from
Data (ACM-TKDD), 2007
◆ The Dato Team◆ My colleagues at Think Big