tamr // data driven nyc // september 2014

Moving Data Curation from Theory to Practice

Ihab Ilyas

University of Waterloo

Data Curation: Many Definitions and One Goal

Extract Value from Data

3

Many Technical Challenges

Automatic Schema Mapping

4

Example: a global equipment manufacturer with thousands of products across

hundreds of databases from multiple suppliers talking about the same part numbers

<Part Number>


Record Linkage and Deduplication

5

Example: Thomson Reuters spent 6 months on a single deduplication project

of a subset of their data sources

Record 1

Record 2

Record 3

Record 4

Unified Record


Missing Values

6

Example: Most real data collected from sensors, surveys, agents, have a high

percentage of N/A or nulls, special values (99999) etc.

ID Name ZIP City State Income

1 Green 60610 Chicago IL 30k


3 Peter New Yrk NY 40k

4 John 11507 New York NY 40k

5 Gree 90057 Los Angeles CA 55k

6 Chuck 90057 San Francisco CA 30k

Common Data Quality Issues

7

Duplicates

Syntactic ErrorIntegrity Constraint Violation

Missing Value


11507 New York

Los Angeles

Are We Missing the Real Challenges?

"Poor data quality is the norm rather than the exception, but most organizations are in a state of denial about this issue. " -Gartner Group

8

Realities of Data Curation Efforts

Data is owned by people and is not an orphan

9

Result: Fully automated cleaning will probably never be adopted

in an enterprise setting

Data

StewardsIT

Data

Experts

Constant

Interaction


Scale renders most solutions un-deployable

10

Result: Need to rethink all cleaning algorithms including record

linkage to work at scale and avoid quadratic complexity. De-duping

one million records naïvely can take weeks (even on a big machine)


Data Variety is even worse

11

Result: Curation requires its own stack including transformations and adaptors


Iterative by nature -- not by design

12

Result: Need to be incremental, agile and low startup overhead. A curation

solution should be a part of the data production line

Data stream

Pragmatic ≠ Unprincipled

Leverage All Data• Data ownership

• Scale

• Variety

• Incremental cleaning

[CIDR 2013]

Structured and

Semi-structured

Data Sources

Collaborative

Curation

Data Experts

(Source owners)

Data Stewards

and Curators

Data

Inventory

APIs

Systems

Tools

Data

Scientists

Advanced

Algorithms &

Machine

Learning

Expert

Input

Integrated

Data &

Metadata

Identify sources, understand relationships and curate the massive variety of siloed data

Expert

Directory

Tamr: Machine Learning with Human Insight

15

Data OwnershipNon-programmatic Interfaces

Structured and

Semi-structured

Data Sources

Collaborative

Curation

Data Experts

(Source owners)

Data Stewards

and Curators

Data

Inventory

APIs

Systems

Tools

Data

Scientists

Advanced

Algorithms &

Machine

Learning

Expert

Input

Integrated

Data &

Metadata


Expert

Directory


16

Scale & Variety

Structured and

Semi-structured

Data Sources

Collaborative

Curation

Data Experts

(Source owners)

Data Stewards

and Curators

Data

Inventory

APIs

Systems

Tools

Data

Scientists

Advanced

Algorithms &

Machine

Learning

Expert

Input

Integrated

Data &

Metadata


Expert

Directory


17

Incremental

18

Example Tamr Functionality: Entity Resolution

ID name ZIP Income

P1 Green 51519 30k

P2 Green 51518 32k

P3 Peter 30528 40k

P4 Peter 30528 40k

P5 Gree 51519 55k

P6 Chuck 51519 30k

ID name ZIP Income

C1 Green 51519 39k

C2 Peter 30528 40k

C3 Chuck 51519 30k

Compute Pair-wiseSimilarity

P1 P2

P3 P4P5

P60.3 0.5

0.9

1.0

Cluster Similar

Records

P1 P2

P3 P4P5

P6

MergeClusters C1 C3

C2

Relation with duplicates

Clean Relation

Tamr Solution

• Object linkage model– Treats schema mapping and record linkage as one process

– Feature extraction and evidence accumulation

• Novel fuzzy blocking– A hierarchy of classifiers from binning candidates in the same comparison

clusters to aggressive linkage of duplicates

• Open-channel with humans in different capacities– Expert to increase the training sets

– Stewards for verification and application of updates

– IT for modeling user enterprise rules

19

Tamr Architecture

20

Suggestions to Build a Unified Schema

21

ML to Match Records Across Sources

22

www.tamr.com

Thank you

http://www.tamr.com/

tamr // data driven nyc // september 2014

Technology