my law

24
Exercise Data Preparation

Upload: sankett

Post on 09-Jun-2015

193 views

Category:

Technology


0 download

DESCRIPTION

tester

TRANSCRIPT

Page 1: My Law

Exercise

Data Preparation

Page 2: My Law

2

Modeling Example

From population of lapsing donors, identify individuals worth continued solicitation.

Business:

Objective:

National veterans’ organization

Source: 1998 KDD-Cup Competition via UCI KDD Archive

Page 3: My Law

3

The Story

A national veterans’ organization seeks to better target its solicitations for donation. By only soliciting the most likely donors, less money will be spent on solicitation efforts and more money will be available for charitable concerns.

Solicitations involve sending a small gift to an individual together with a request for donation. Gifts include mailing labels and greeting cards.

Of particular interest is the class of individuals identified as lapsing donors. These individuals made their most recent donation between 12 and 24 months ago. The organization found that by predicting the response behavior of this group, they can use the model to rank all 3.5 million individuals in their database.

The current campaign refers to a greeting card mailing sent in 06/1997. The source of this data is the Association for Computing Machinery’s

(ACM) 1998 KDD-Cup competition.

Page 4: My Law

4

Additional Data Preparation

Raw Analysis Data

95,412 Records481 Fields

Final Analysis Data

19,372 Records50 Fields

The raw analysis data has been reduced for the purpose of this course. A subset ofslightly over 19,000 records has been selected for modeling. As will be seen, thissubset was not chosen arbitrarily. In addition, the 481 fields have been reduced to 50.

Page 5: My Law

5

Analysis Data Definition

CONTROL_NUMBERMONTHS_SINCE_ORIGININ_HOUSE

Donor master data

Unique Donor IDElapsed time since first donation1=Given to In House program,0=Not In House donor

Page 6: My Law

6

Analysis Data Definition

OVERLAY_SOURCEDONOR_AGEDONOR_GENDER

Demographic and other overlay data

M=Metromail, P=Polk, B=bothAge as of June 1997Actual or inferred gender

PUBLISHED_PHONEHOME_OWNER

Published telephone listingH=homeowner, U=unknown

MOR_HIT Mail order response hit rate

Page 7: My Law

7

PER_CAPITA_INCOME Income per capita in dollarsMED_HOUSEHOLD_INCOME Median income in $100s

Demographic and other overlay data

Analysis Data Definition

CLUSTER_CODESES

WEALTH_RATING

54 Socio-economic cluster codes5 Socio-economic cluster codes

10 wealth rating groups

INCOME_GROUP 7 income group levels

SES is a roll-up of the socio-economic field CLUSTER_CODE

Page 8: My Law

8

PCT_OWNER_OCCUPIED Percent owner occupied housing

Demographic and other overlay data

Analysis Data Definition

MED_HOME_VALUE Median home value in $100s

URBANICITY U=urban, C=city, S=suburban,T=town, R=rural, ?=unknown

Page 9: My Law

9

Census overlay data

Analysis Data Definition

PCT_MALE_MILITARY Percent male military in blockPCT_MALE_VETERANSPCT_VIETNAM_VETERANS

Percent male veterans in blockPercent Vietnam veterans in block

PCT_WWII_VETERANS Percent WWII veterans in block

Page 10: My Law

10

Number card promotions last 12 mos.

Analysis Data Definition

Time

`94 `97`96`95 `98

Transaction detail data

NUMBER_PROM_12 Number promotions last 12 mos.CARD_PROM_12

97NK

Page 11: My Law

11

Analysis Data Definition

Time

`94 `97`96`95 `98

97NK

Transaction detail data

FREQ_STATUS_97NK Frequency status, June `97RECENCY_STATUS_96NK Recency status, June `96

96NK

LAST_GIFT_AMT Amount of most recent donationMONTHS_SINCE_LAST Months since last donation

Page 12: My Law

12

Analysis Data Definition

Time

`94 `97`96`95 `98

94NK

RECENT transaction detail data

RESPONSE_PROP Response proportion since June `94RESPONSE_COUNT

96NK

AVG_GIFT_AMTResponse count since June `94Average gift amount since June `94

RECENT_STAR_STATUS STAR (1, 0) status since June `94

The sampling method implies that no one made a donation between 6/1996 and 6/1997.However, for a limited number of cases, the number of months since last gift is fewerthan 12. This contradiction is not resolved in the data’s documentation, nor will it beresolved here.

Page 13: My Law

13

Analysis Data Definition

Time

`94 `97`96`95 `98

94NK

RECENT transaction detail data

CARD_RESPONSE_PROP Response proportion since June `94CARD_RESPONSE_COUNT

96NK

CARD_AVG_GIFT_AMTResponse count since June `94Average gift amount since June `94

Page 14: My Law

14

Analysis Data Definition

Time

`94 `97`96`95 `98

94NK

LIFETIME transaction detail data

PROM Total number promotions everGIFT_COUNT

96NK

AVG_GIFT_AMTTotal number donations everOverall average gift amount

PEP_STAR STAR status ever (1=yes, 0=no)

Page 15: My Law

15

GIFT_RANGE Maximum less minimum gift amount

Analysis Data Definition

Time

`94 `97`96`95 `98

94NK

LIFETIME transaction detail data

GIFT_AMOUNT Total gift amount everGIFT_COUNT

96NK

Total number donations everMAX_GIFT Maximum gift amount

Page 16: My Law

16

MONTHS_SINCE_LAST Last donation date from June `97

Analysis Data Definition

Time

`94 `97`96`95 `98

94NK

KDD supplied LIFETIME transaction detail data

FILE_AVG_GIFT Average gift from raw dataFILE_CARD_GIFT

96NK

MONTHS_SINCE_FIRST First donation date from June `97Average card gift raw data

Page 17: My Law

17

Analysis Data Definition

Transaction detail data target definition

Time

`94 `97`96`95 `98

97NK

TARGET_B Response to 97NK solicitation (1=yes 0=no)TARGET_D Response amount to 97NK solicitation

(missing if no response)

Page 18: My Law

18

Demonstration

Data set: PVA_RAW_DATA

Purpose: Get familiar with the data Basic decision modeling with tree, regression, and neural

network

Parameters: Prior probabilities: (0.05, 0.95) Profit matrix: ($14.62, -0.68) Target: TARGET_B (TARGET_D must be rejected)

Page 19: My Law

19

Improving Regression Selection

0

15

30

45

60

25 50 75 100

Number of Variables

AllSubsets

StepwiseMin

utes

Page 20: My Law

20

Improving Input Selection

Much of the success of a predictive model depends on input selection. Most input selection processes attempt to minimize input redundancy and maximize input relevancy.

Selection is usually using a heuristic search because the complexity of an exhaustive (all subsets) search increases exponentially in the number of inputs.

There exist branch-and-bound algorithms that approximate an exhaustive input search and run quite quickly for a reasonably small number of inputs. One algorithm, found in the SAS/STAT LOGISTIC procedure, actually runs faster than the usual forward, backward, and stepwise procedures.

While the example data set in this course has fewer than 60 inputs, many modeling data sets do not. Given the promise of an exhaustive search, it would be extremely desirable to reduce the input count without compromising the quality of the ultimate predictive model.

Page 21: My Law

21

Improving Input Selection

Univariate Screening

Variable Clustering

Categorical Recoding

All Subsets Selection

Page 22: My Law

22

Input Dimension Reduction

A three-phased approach is proposed for input dimension reduction in preparation for all subsets selection.

First, a univariate screening is performed to eliminate those inputs with little promise of target association. This must be done with care to avoid eliminating inputs whose predictive value occurs only in conjunction with other inputs.

Second, variable clustering techniques are used to group correlated interval inputs and minimize input redundancy.

Third, enhanced weight-of-evidence methods are used to effectively incorporate categorical inputs into the final model.

With the input dimension reduced, an all subsets search commences on the remaining inputs.

Page 23: My Law

23

Univariate Screening

In this technique, inputs are screened based on their individual correlation with the target and only the inputs with the highest correlations are kept.

Unfortunately, this approach does not account for partial associations among the inputs. Inputs could be erroneously omitted or erroneously included. Partial associations occur when the effect of one input changes in the presence of another input.

A compromise devised to minimize the dangers of partial associations is to use univariate screening followed by liberal forward selection—not as a way of finding useful inputs, but rather as a way to eliminate clearly useless ones.

Page 24: My Law

24

R-square Selection for Univariate Screening

The R-square selection approach has two phases. First, the input/target correlation is calculated for each

input. Each input with a correlation below the minimum R-square setting is rejected.

Second, a forward election is performed. The forward selection procedure terminates when all remaining inputs have a correlation below the specified stop R-square. These remaining inputs are also rejected.