tamr // data driven nyc // september 2014
DESCRIPTION
Tamr Co-Founder Ihab Ilyas presented at September 2014's edition of Data Driven NYC. Tamr is a data connection platform.TRANSCRIPT
Moving Data Curation from Theory to Practice
Ihab Ilyas
University of Waterloo
2
Data Curation: Many Definitions and One Goal
Extract Value from Data
3
Many Technical Challenges
Automatic Schema Mapping
4
Example: a global equipment manufacturer with thousands of products across
hundreds of databases from multiple suppliers talking about the same part numbers
<Part Number>
Many Technical Challenges
Record Linkage and Deduplication
5
Example: Thomson Reuters spent 6 months on a single deduplication project
of a subset of their data sources
Record 1
Record 2
Record 3
Record 4
Unified Record
Many Technical Challenges
Missing Values
6
Example: Most real data collected from sensors, surveys, agents, have a high
percentage of N/A or nulls, special values (99999) etc.
ID Name ZIP City State Income
1 Green 60610 Chicago IL 30k
2 Green 60611 Chicago IL 32k
3 Peter New Yrk NY 40k
4 John 11507 New York NY 40k
5 Gree 90057 Los Angeles CA 55k
6 Chuck 90057 San Francisco CA 30k
Common Data Quality Issues
7
Duplicates
Syntactic ErrorIntegrity Constraint Violation
Missing Value
1 Green 60610 Chicago IL 31k
11507 New York
Los Angeles
Are We Missing the Real Challenges?
"Poor data quality is the norm rather than the exception, but most organizations are in a state of denial about this issue. " -Gartner Group
8
Realities of Data Curation Efforts
Data is owned by people and is not an orphan
9
Result: Fully automated cleaning will probably never be adopted
in an enterprise setting
Data
StewardsIT
Data
Experts
Constant
Interaction
Realities of Data Curation Efforts
Scale renders most solutions un-deployable
10
Result: Need to rethink all cleaning algorithms including record
linkage to work at scale and avoid quadratic complexity. De-duping
one million records naïvely can take weeks (even on a big machine)
Realities of Data Curation Efforts
Data Variety is even worse
11
Result: Curation requires its own stack including transformations and adaptors
Realities of Data Curation Efforts
Iterative by nature -- not by design
12
Result: Need to be incremental, agile and low startup overhead. A curation
solution should be a part of the data production line
Data stream
Pragmatic ≠ Unprincipled
Leverage All Data• Data ownership
• Scale
• Variety
• Incremental cleaning
[CIDR 2013]
Structured and
Semi-structured
Data Sources
Collaborative
Curation
Data Experts
(Source owners)
Data Stewards
and Curators
Data
Inventory
APIs
Systems
Tools
Data
Scientists
Advanced
Algorithms &
Machine
Learning
Expert
Input
Integrated
Data &
Metadata
Identify sources, understand relationships and curate the massive variety of siloed data
Expert
Directory
Tamr: Machine Learning with Human Insight
15
Data OwnershipNon-programmatic Interfaces
Structured and
Semi-structured
Data Sources
Collaborative
Curation
Data Experts
(Source owners)
Data Stewards
and Curators
Data
Inventory
APIs
Systems
Tools
Data
Scientists
Advanced
Algorithms &
Machine
Learning
Expert
Input
Integrated
Data &
Metadata
Identify sources, understand relationships and curate the massive variety of siloed data
Expert
Directory
Tamr: Machine Learning with Human Insight
16
Scale & Variety
Structured and
Semi-structured
Data Sources
Collaborative
Curation
Data Experts
(Source owners)
Data Stewards
and Curators
Data
Inventory
APIs
Systems
Tools
Data
Scientists
Advanced
Algorithms &
Machine
Learning
Expert
Input
Integrated
Data &
Metadata
Identify sources, understand relationships and curate the massive variety of siloed data
Expert
Directory
Tamr: Machine Learning with Human Insight
17
Incremental
18
Example Tamr Functionality: Entity Resolution
ID name ZIP Income
P1 Green 51519 30k
P2 Green 51518 32k
P3 Peter 30528 40k
P4 Peter 30528 40k
P5 Gree 51519 55k
P6 Chuck 51519 30k
ID name ZIP Income
C1 Green 51519 39k
C2 Peter 30528 40k
C3 Chuck 51519 30k
Compute Pair-wiseSimilarity
P1 P2
P3 P4P5
P60.3 0.5
0.9
1.0
Cluster Similar
Records
P1 P2
P3 P4P5
P6
MergeClusters C1 C3
C2
Relation with duplicates
Clean Relation
Tamr Solution
• Object linkage model– Treats schema mapping and record linkage as one process
– Feature extraction and evidence accumulation
• Novel fuzzy blocking– A hierarchy of classifiers from binning candidates in the same comparison
clusters to aggressive linkage of duplicates
• Open-channel with humans in different capacities– Expert to increase the training sets
– Stewards for verification and application of updates
– IT for modeling user enterprise rules
19
Tamr Architecture
20
Suggestions to Build a Unified Schema
21
ML to Match Records Across Sources
22