how to recognize and fix data faults that can make or...

Post on 25-Sep-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Cary S. Shaw & Associates, LLC

How to Recognize

and Fix Data Faults

That Can Make or Break Analytics

Data

Interpretation

Sampling

By Cary S. Shaw & Associates, LLC

For further information contact (203)505-3180

caryshaw@optonline.net

© Copyright Cary S. Shaw 2013

For dissemination, permission required from copyright holder.

Cary S. Shaw & Associates, LLC

Get the Data Quality Right

Business depends on data.

Whether you are:

Requestor

Developer

Producer

Interpreter

Analytics Consultant

Analytics Recipient

Cary S. Shaw & Associates, LLC

We will cover:

Qualify

Sniff, Define, Frequency Check

Interpret

Lurk, Miss, Population

Sample

Methods, Application

Summary

Tools, Lessons

Cary S. Shaw & Associates, LLC

How many products are in company’s

data base?

Marketing client sent request to IT

Detail result provided in Excel.

Marketing client relayed file: Answer: 1,000.

Cary S. Shaw & Associates, LLC

How many products are in company’s

data base?

Marketing client sent request to IT

Detail result provided in Excel.

Marketing client relayed file: Answer: 1,000.

What happened: SQL Navigator timed out.

Real answer: 3,646.

Cary S. Shaw & Associates, LLC

Spreadsheet data

received in this form.

Business Name Street City State Zip Phone

xxxxxxxxxx xxxxxxxxxx xxxxx xx xxxxx (xxx)xxx-xxxx

xxxxxxxxxx xxxxxxxxxx xxxxx xx xxxxx (xxx)xxx-xxxx

xxxxxxxxxx xxxxxxxxxx xxxxx xx xxxxx (xxx)xxx-xxxx

xxxxxxxxxx xxxxxxxxxx xxxxx xx xxxxx (xxx)xxx-xxxx

xxxxxxxxxx xxxxxxxxxx xxxxx xx xxxxx (xxx)xxx-xxxx

xxxxxxxxxx xxxxxxxxxx xxxxx xx xxxxx (xxx)xxx-xxxx

xxxxxxxxxx xxxxxxxxxx xxxxx xx xxxxx (xxx)xxx-xxxx

xxxxxxxxxx xxxxxxxxxx xxxxx xx xxxxx (xxx)xxx-xxxx

xxxxxxxxxx xxxxxxxxxx xxxxx xx xxxxx (xxx)xxx-xxxx

xxxxxxxxxx xxxxxxxxxx xxxxx xx xxxxx (xxx)xxx-xxxx

xxxxxxxxxx xxxxxxxxxx xxxxx xx

Cary S. Shaw & Associates, LLC

Assume data is in error.

Search for errant values.

Frequency tables

Outliers - Confidence intervals

Outliers - One variable graphed/estimated against

another

Referential Integrity (cross table connections within an

integrated database)

Sniff test

Don’t dismiss data if it is different; only if it is wrong.

Cary S. Shaw & Associates, LLC

How unique are our customer phone

numbers?

Are we really reaching the person / business we intend?

Are we alienating by multiple calls?

Are we missing multiple clients?

Practical question: How frequently does the same phone

number appear on our customer contact file?

Cary S. Shaw & Associates, LLC

Exploring Dividend Yields

Found company with 7.2 % dividend yield

What’s going on?

Cary S. Shaw & Associates, LLC

Could a ‘9’ or ‘0’

mean something else?

Cary S. Shaw & Associates, LLC

How do we want to select business

vs. home prospects?

H = Home

B = Business

And how do we want to convert this to a numeric

factor?

Cary S. Shaw & Associates, LLC

Select B (business) or Reject H

(home)? Different impact.

Other values have impact

Is field up-to-date?

Is client (blank) different

from proprietary (blank)?

Is (blank) or misspelling a

proxy for other

characteristics?

How about Duns Industry

sector?

Business, Home Frequency

(blank) 98,947

B 5,032

b 159

Business 49

Busn 1

H 24,891

h 475

Home 63

Hume 1

Hme 11

Cary S. Shaw & Associates, LLC

Riddles

Product Usage a predictor of retention

Optimum student loan amount

Effect of sales channel on retention

Longevity dependency on location

Cary S. Shaw & Associates, LLC

Fitting a line: effect of product usage on retention

Large database

Cary S. Shaw & Associates, LLC

Effect of Product Usage (Line fitted to

data points) Effect of Usage on Cancel Rate??

0 100 200 300 400 500

Meter Usage

Can

cel R

ate

Cancel Rate

Linear (Cancel Rate)

Cancel Rate = f(Usage) = constant

Cary S. Shaw & Associates, LLC

Is ‘zero’ really ‘missing’? Impact

on bining and fitting. Effect of Usage on Cancel Rate??

0 100 200 300 400 500

Meter Usage

Can

cel R

ate

Cancel Rate

Linear (Cancel Rate)

Cary S. Shaw & Associates, LLC

Student loans and loan delinquency

Cary S. Shaw & Associates, LLC

Do small loans result in high

delinquencies? Dismissal factor? Effect of Loan Amount on Delinquency

0.0000%

1.0000%

2.0000%

3.0000%

4.0000%

5.0000%

6.0000%

7.0000%

2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 22000 24000 26000

Loan Amount

De

lin

qu

en

cy

Rate

.

Loans in Collection

Cary S. Shaw & Associates, LLC

1 out of 10 in the original

population

Cary S. Shaw & Associates, LLC

Looks like 1 out of 4 in the

remaining population

Cary S. Shaw & Associates, LLC

Watch the whole population of interest

(affects the denominator) Effect of Loan Amount on Delinquency

0.0000%

1.0000%

2.0000%

3.0000%

4.0000%

5.0000%

6.0000%

7.0000%

2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 22000 24000 26000

Loan Amount

De

lin

qu

en

cy

Rate

.

Loans in Collection

Original Loans in Collection

Cary S. Shaw & Associates, LLC

Where you live can affect your life span.

Connecticut ?

Cary S. Shaw & Associates, LLC

Where is Mortality Lowest? Mortality (Prob of Dying Next Yr) As % of U.S. Rate

Alaska............, 57.0

Connecticut......., 102.5

50.0

60.0

70.0

80.0

90.0

100.0

110.0

Reported Mortality Index

Ind

ex (

US

= 1

00)

Alaska............

Connecticut.......

Cary S. Shaw & Associates, LLC

Is Connecticut Mortality Above Avg? Connecticut Mortality (Prob of Dying Next Yr) As % of U.S. Rate

Reported Total

Under 1

1-4

5-14 15-24

25-34

35-44

45-54

55-64

65-74

75-84

85 years

50.0

60.0

70.0

80.0

90.0

100.0

110.0

Age Group

Ind

ex (

US

=100)

Connecticut.......

Cary S. Shaw & Associates, LLC

Compare with Alaska by Age Mortality (Prob of Dying Next Yr) As % of U.S. Rate

Under 1

1-4

5-14

15-24

25-34

35-44

45-54

55-64

65-7475-84

85 years

Under 1

1-4

5-14 15-24

25-34 35-4445-54 55-64

65-7475-84

85 years

0.0

20.0

40.0

60.0

80.0

100.0

120.0

140.0

160.0

180.0

200.0

Age Group

Ind

ex (

US

=1

00

)

Alaska............

Connecticut.......

Cary S. Shaw & Associates, LLC

Compare Observed with Age-Adjusted Mortality Rates by State as Percent of U.S.

40.0

60.0

80.0

100.0

120.0

140.0

160.0

180.0

50.0 60.0 70.0 80.0 90.0 100.0 110.0 120.0 130.0 140.0 150.0

Reported Mortality Index

Ag

e A

dju

ste

d M

ort

ality

In

dex

Series1

Linear (Series1)

Alaska

CT

Cary S. Shaw & Associates, LLC

Compare Rankings State Ranks Compared

0

13

26

39

52

0 13 26 39 52

Reported State Rank

Ad

jus

ted

Sta

te R

an

k

RankAdjusted

Linear (RankAdjusted)

Alaska

CT

Cary S. Shaw & Associates, LLC

For full table identifying mortality rate in each state,

contact Cary S. Shaw & Associates, LLC

Cary S. Shaw & Associates, LLC

Riddles

Product Usage a predictor of retention

Optimum student loan amount

Effect of sales channel on retention

Longevity dependency on location

Cary S. Shaw & Associates, LLC

Lessons

Separate cause and effect by chronology.

Find the initial populations under study.

Distinguish between missing and zero values.

Search for significant lurking variables.

Creatively test your own ideas.

Cary S. Shaw & Associates, LLC

Why Sample?

Examine large databases

Rich on detail, economical on resources

Check own analytics

Cross check IT

Cross check vendors

May enable GREATER accuracy because

Enables correction of representative points

Enables understanding of representative points

Cary S. Shaw & Associates, LLC

Which sample is most

representative?

Beginning Middle

Cary S. Shaw & Associates, LLC

Which sample is most

representative?

Beginning Middle Random

Cary S. Shaw & Associates, LLC

Which sample is most

representative?

Beginning Middle Random Stratified Random

Cary S. Shaw & Associates, LLC

Stratified Random is making sure

the sample is representative.

Method A:

Divide data into strata (categories, deciles, etc.)

Pick target number from each strata

Method B:

Sort file by factors known to be important (such as

income, age, even ID number)

Pick every n th record, starting with record n/2

Cary S. Shaw & Associates, LLC

What happens when you match

samples?

How big is the resulting sample?

Year 1 Year 2 Year 3

10% 10% 10% = 0.1%

Customer Product Sales$ Supplies A/R

10% 10% . 10% 10% 10%

= 0.001%

Cary S. Shaw & Associates, LLC

There’s a way to take samples that

match even as the data is changing

(don’t lose by inner joins)

Cary S. Shaw & Associates, LLC

Examples of Record IDs

Social Security Number

Credit Card Number

Account ID

Develop algorithm to transform Record ID into

Random Number.

Then select from each file based on Random Number

range.

Samples from different files will then match.

Cary S. Shaw & Associates, LLC

Example: Duns File Update

How did NAICS code change?

Jan NAICS = April NAICS

Jan NAICS <> April NAICS

No January Record

No April Record

Cary S. Shaw & Associates, LLC

Example: Duns File Update

How did NAICS code change?

Jan NAICS = April NAICS 1006

Jan NAICS <> April NAICS 0

No January Record 0

No April Record 0

Cary S. Shaw & Associates, LLC

Example: Duns File Update

How did NAICS code change?

Jan NAICS = April NAICS 829

Jan NAICS <> April NAICS 107

No January Record 64

No April Record 70

Cary S. Shaw & Associates, LLC

We covered:

Qualify

Sniff, Define, Frequency Check

Interpret Lurk, Miss, Population

Sample

Methods, Application

Summary Tools, Lessons

Cary S. Shaw & Associates, LLC

Lessons

Perform the sniff test.

Check number of records.

Do frequency counts (highest, lowest).

Look for missing values (distinguish from zeros and

duplicates).

Investigate lurking interactive variables.

Analyze what happened to original populations, not

final populations.

Examine representative samples (stratified preferred).

Assume errors and look for them. (Apply G.I.V.O.).

Cary S. Shaw & Associates, LLC

- FIN -

Avoid the pitfalls of found

data and reach for the Stars

For further information contact:

Cary S. Shaw & Associates, LLC

(203)505-3180

caryshaw@optonline.net

top related