dirty data data cleansing xxxxxx dsci 5240 december 4, 2012

22
Dirty Data Data Cleansing Xxxxxx DSCI 5240 December 4, 2012

Upload: daisy-sheryl-quinn

Post on 16-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Dirty Data Data Cleansing Xxxxxx DSCI 5240 December 4, 2012

Dirty DataData Cleansing

XxxxxxDSCI 5240

December 4, 2012

Page 2: Dirty Data Data Cleansing Xxxxxx DSCI 5240 December 4, 2012

Introduction

• Real data is dirty

• Why clean?–Eliminate duplicates–Smaller database–Accurate statistics

• The problem–Merge/Purge of large databases

Page 3: Dirty Data Data Cleansing Xxxxxx DSCI 5240 December 4, 2012

Preview

• Data Cleansing Solutions

• Real World Data

• OCAR’s Data

• Conclusion

Page 4: Dirty Data Data Cleansing Xxxxxx DSCI 5240 December 4, 2012

Data Cleansing Solutions

• Sorted-Neighborhood Method

• Equational Theory

• Transitive Closure

Page 5: Dirty Data Data Cleansing Xxxxxx DSCI 5240 December 4, 2012

Sorted-Neighborhood Method

• Three phases– 1. create keys– 2. sort the data– 3. merge

• Three passes using different key– Multi-pass method

Page 6: Dirty Data Data Cleansing Xxxxxx DSCI 5240 December 4, 2012

Sorted-Neighborhood Method

• Key selection

First Name Last Name Address ID Key

Sal Stolfo 123 First Street 45678987 STLSAL123FRST456

Sal Stolfo 123 First Street 45678987 STLSAL123FRST456

Sal Stolpho 123 First Street 45678987 STLSAL123FRST456

Sal Stiles 123 Forest Street 45654321 STLSAL123FRST456

Page 7: Dirty Data Data Cleansing Xxxxxx DSCI 5240 December 4, 2012

Sorted-Neighborhood Method

• Sort using the key selected

First Name Last Name Address ID Key

Sal Stolfo 123 First Street 45678987 STLSAL123FRST456

Sal Stolfo 123 First Street 45678987 STLSAL123FRST456

Sal Stolpho 123 First Street 45678987 STLSAL123FRST456

Sal Stiles 123 Forest Street 45654321 STLSAL123FRST456

Page 8: Dirty Data Data Cleansing Xxxxxx DSCI 5240 December 4, 2012

Sorted-Neighborhood Method

• A ‘window size’ is created for merging

First Name Last Name Address ID Key

Sal Stolfo 123 First Street 45678987 STLSAL123FRST456

Sal Stolfo 123 First Street 45678987 STLSAL123FRST456

Sal Stolpho 123 First Street 45678987 STLSAL123FRST456

Sal Stiles 123 Forest Street 45654321 STLSAL123FRST456

Page 9: Dirty Data Data Cleansing Xxxxxx DSCI 5240 December 4, 2012

Merge Phase - Equational Theory

• A set of equation rules that defines equivalence

• A type of clustering function (pattern recognition)

• Rules may require an expert

Page 10: Dirty Data Data Cleansing Xxxxxx DSCI 5240 December 4, 2012

Merge Phase - Equational TheoryEnglish rules:

Given two records, r1 and r2.

IF(the last names of r1 equals the last name of r2,

AND the first names differ slightly,

ANDthe address of r1 equals the address of r2)

THENR1 is equivalent to r2

Page 11: Dirty Data Data Cleansing Xxxxxx DSCI 5240 December 4, 2012

Merge Phase - Equational TheoryResults

SSN Name (First, Initial, Last) Address

334600443 Lisa Boardman 144 Wars St.

334600443 Lisa Brown 144 Ward St.

525520001 Ramon Bonilla 38 Ward St.

525250001 Raymond Bonilla 38 Ward St.

0 Diana D. Ambrosion 40 Brik Church Av.

0 Diana A. Dambrosion 40 Brick Church Av.

0 Colette Johnen 600 113th St. apt.5a5

0 John Colette 600 113th St. ap. 585

850982319 Ivette A Keegan 23 Florida Av.

950982319 Yvette A Kegan 23 Florida St.

r1

r2

Page 12: Dirty Data Data Cleansing Xxxxxx DSCI 5240 December 4, 2012

Merge Phase - Transitive Closure

• Applied to a single pass sorted-neighborhood method

• Improvement of accuracy

• Decreases processing time and cost

Page 13: Dirty Data Data Cleansing Xxxxxx DSCI 5240 December 4, 2012

Merge Phase - Transitive Closure

English rules:

Given three records a, b and c.

IF (a is similar to b

ANDb is similar to c)

THENa is similar to c

Page 14: Dirty Data Data Cleansing Xxxxxx DSCI 5240 December 4, 2012

Real World Data

• State of Washington Department of Social and Health Services

• Office of Children Administrative Research (OCAR) of the Department of Social and Health Services

Page 15: Dirty Data Data Cleansing Xxxxxx DSCI 5240 December 4, 2012

OCAR’s Data• 6,000,000 records• Grows by 50,000 per month• 19 fields

– First and last name– Birthdate– SSN– Case number– Worker ID– Gender– Race– Service ID– Service dates– Payments

Page 16: Dirty Data Data Cleansing Xxxxxx DSCI 5240 December 4, 2012

OCAR’s Data - Problems

• Names misspelled• Missing birthdates• Missing or wrong SSN• Multiple case numbers • Ghost records

Page 17: Dirty Data Data Cleansing Xxxxxx DSCI 5240 December 4, 2012

OCAR’s Data - Goals

• To answer:– “How many children are in foster care?”– “How long do children stay in foster care?”– “How many different homes do children typically

stay in?”

Page 18: Dirty Data Data Cleansing Xxxxxx DSCI 5240 December 4, 2012

OCAR’s Data - Cleaning• 128,438 records sampled (one service office)• Consulted with expert1

• 24 rules established• Used sorted-neighborhood multi-pass methods• Applied equational theory• Keys

– 1. Last name, First name, SSN, and Case number– 2. First name, Last name, SSN, and Case number– 3. Case number, First name, Last name, and SSN

1Timothy Clark, OCAR Computer Information Consultant

Page 19: Dirty Data Data Cleansing Xxxxxx DSCI 5240 December 4, 2012

OCAR’s Data - Results

• Identified 8,504 individuals in sample

• 45.8% correctly classified

• 86.0% where correctly merged

• Multi-pass sorted-neighborhood confirmed

Page 20: Dirty Data Data Cleansing Xxxxxx DSCI 5240 December 4, 2012

Review

• Multi-pass sorted-neighborhood method

• Equational method

• OCAR’s data

Page 21: Dirty Data Data Cleansing Xxxxxx DSCI 5240 December 4, 2012

Conclusions

• Sort-neighborhood method can be expensive– During the sorting phase• Process time

• improved accuracy– Multiple times– Small windows– Computation of the transitive closure

Page 22: Dirty Data Data Cleansing Xxxxxx DSCI 5240 December 4, 2012

Sources

• Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem; Mauricio A. Hernandez and Salvatore J. Stolfo; Department of Computer Science, Columbia University, New York, NY 10027.

• Haiguang Li, 2011 class presentation

• www.cs.columbia.edu/~sal

• http://www.dshs.wa.gov/default.shtm