lecture 2 understand the structure of sas...

30
Lecture 2 Understand the Structure of SAS Enterprise Miner Section 2.1 Getting Started Section 2.2 Data Mining Using SEMMA 2.2.1 Models for Data Mining 2.2.2 SEMMA Model in SAS Enterprise Miner Section 2.3 Sample 2.3.1 Input Data Source node 2.3.2 Data Partition node Section 2.4 Exploration with INSIGHT node Section 2.5 Modify 2.5.1 Modify with Transformation node 2.5.2 Modify with Replacement node Section 2.6 Fitting Candidate Models Section 2.7 Assessing Candidate Models and Scoring New Data Set Appendix 2.1 Data used in Lecture 2 Appendix 2.2 Reference for Lecture 2

Upload: phamdang

Post on 29-May-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Lecture 2 Understand the Structure of SAS Enterprise Miner

Section 2.1 Getting Started Section 2.2 Data Mining Using SEMMA

2.2.1 Models for Data Mining 2.2.2 SEMMA Model in SAS Enterprise Miner

Section 2.3 Sample 2.3.1 Input Data Source node 2.3.2 Data Partition node

Section 2.4 Exploration with INSIGHT node Section 2.5 Modify 2.5.1 Modify with Transformation node 2.5.2 Modify with Replacement node Section 2.6 Fitting Candidate Models Section 2.7 Assessing Candidate Models and Scoring New Data Set Appendix 2.1 Data used in Lecture 2 Appendix 2.2 Reference for Lecture 2

2

Section 2.1 Getting Started

1. Type Miner in the command box to start Enterprise Miner 4.1 (or by Selecting Solutions -> Analysis -> Enterprise Miner to start Enterprise Miner)

2. Select File -> New -> Project to start a new project 3. Before selecting the Create, you need to

• Type your project tile in the Name box • Check the box for Client/Server project only if you want to create a client

server project • Modify the location of the project if desired by selecting Browse

3

Section 2.2 Data Mining Using SEMMA in SAS Section 2.2.1 Frameworks for Data Mining

Various "general frameworks" have been proposed to serve as blueprints for data mining. The approach proposed by SAS Institute, called SEMMA, includes:

which is focused more on the technical activities typically involved in a data mining project. There are several alternative frameworks available, e.g., CRISP, all concerned with the process of how to integrate data mining methodology into an organization, how to "convert data into information," how to involve important stake-holders, and how to disseminate the information in a form that can easily be converted by stake-holders into resources for strategic decision making.

Section 2.2.2 Data Mining Using SEMMA in SAS The tools in enterprise/Miner are arranged according SEMMA, the SAS process for data mining. SEMMA stands for Sample, Explore, Modify, Model, and Assess.

• Sample: Input Data Source Node: The "Input Data Source" node reads data sources

and defines their attributes. It enables the user to access SAS data sets and data marts. It automatically creates the meta-data sample for each variable when you

import a data set with the Input Data Source node. It sets the initial values for the measurement level and the model role for

each variable. It displays summary statistics for interval and class variables. It enables you to define target profiles for each target variable.

Sampling Node: The "Sampling" node enables you to take simple random, stratified random, and cluster samples of the data set. Sampling is recommended for extremely large database because it can significantly decrease model-training time. In addition, the result found from a representative sample should be expected to similar to result found with the entire data. However, we will not use this node much in classroom instruction because the classroom data sets typically have a relatively small size.

Data Partition Node: the "Data Partition" node enables you to partition data sets into training, testing, and validation data sets. The training data set is used preliminary model fitting. The validation data set is used to monitor and tune the model during estimation and is also used for model assessment. The test data set is an additional holdout data set that can be used for model assessment. This node uses simple random, stratified random, or user defined partitions to create partitioned data sets.

4

• Explore: Distribution Explore Node: The "Distribution Explore" node is a

visualization tool that enables you quickly and easily to explore large volumes of data in multidimensional histograms. You can view distribution of up to three variables at a time with this node. It also allows you to exclude extreme values for interval, nominal, ordinal, and binary variables.

Multiplot Node: The “Multiplot” node is another visualization tool that enables you to explore larger volumes of data graphically. Unlike two other visualization tools, Insight and Distribution Explore, the Multiplot node automatically creates bar charts and scatter plots for the input and target variables without asking any user input and the SAS code used to create these plots are available for batch run as well.

Insight Node: SAS/INSIGHT is an interactive tool for data exploration and analysis. One can analyze univariate distributions, investigate multivariate distributions, and fit explanatory models using generalized linear models.

Association Node: The association node enables you to identify association relationships within the data

Variable Selection Node: The “Variable Selection” node enables you to evaluate the importance of input variables in predicting or classifying the target variable. Both Chi-square (tree based) and R-square selection criteria are available.

• Modify: Data Attributes Node: The data set attributes node enables one to modify

data set attributes, such as data set names, descriptions, and roles. Transform Variable Node: This node enables one to transform variables.

Variable transformation can improve the performance of some data mining techniques such as neural network and logistic regression. This nodes support user defined formulas for variable transformation This nodes support several standard variable transformations such as log

transformation and square root transformation. This nodes support creating an ordinal grouping variable from an interval

variables (buckets) using a decision tree based algorithm. Filter Outliers Node: This node enables one to identify and remove outliers

from data sets. Replacement Node: This node enables one to impute values for observations

that have missing values. Clustering Node: This node enables one to segment data. Observations that

are similar tend to be in the same cluster and observations that are different tend to be in different cluster. The cluster identifier for each observation can be passed to other nodes for use as an input, ID, or target variable. It can also be passed as a group variable that enables you to automatically construct separate models for each group.

SOM/Kohonen Node: This node can generate self-organizing map (SOM), Kohonen networks, and vector quantization networks. This node can be used to understand the structure of the data. In addition, it provides a report that indicates the importance of each variable.

5

• Model:

Regression Node: This node enables one to fit both linear and logistic regression models. The target variable can be continuous, nominal, and binary variables. The input variables can be either interval (continuous) or classification (discrete) variables. The nose supports stepwise, forward, and backward selection methods. It also has interaction builder to help one to create higher-order and interaction terms for the model.

Tree Node: This node enables one to build CART like tree, CHAD like tree, and hybrid tree.

Neural Network Node: This node enables you to build, train, and validate multilayer feed-forward neural networks. Neural network is a universal model builder, i.e.; it can be used to build any generalized linear models or non-linear models.

User Defined Node: This node allows you to generate assessment statistics using predicted values from a model that you build with SAS code node or the variable select node. The predicted values can also be saved to a SAS data set and then imported into the process flow with Input Data source node.

Ensemble Node: This node creates a new model by averaging the posterior probabilities (for discrete target variable) or the predicted values (for continuous target variable) from multiple models. The new model has the potential to be better than each of the input models.

• Assess:

Assessment Node: The “Assessment” node provides a common framework for comparing models and predictions from any of the modeling nodes. The comparison is based on the expected and actual profits or losses that would result from implementing the model. The node produces the following charts that help to describe the usefulness of the model: lift profit, return of investment, receiver operating curve, diagnostic charts, and threshold-based chart.

Score Node: The Score node enables you to generate and manage predicted values from a trained model. Scoring formula is created for both assessment and prediction.

Reporter Node: The Reporter node assembles the results from a process flow analysis into a HTML report that can be viewed with your favorite web browser.

In addition, there are some utility nodes to help you to organize your data mining flow. The utility nodes include SAS Code node, Control Point node, Subdiagram node, Data Mining Data Base node, and Group Processing node.

• Utility Nodes: Group Processing Node: This node enables the user to perform analysis for

each level of a class variable and process the same data source repeatly.

6

Data Mining Database Node: This node enables the user to create a data-mining database (DMDB) for batch processing.

SAS Code Node: This node can be used to incorporate new or existing SAS code into process flow diagram.

Control Point Node: This node enables the user to establish a control to reduce the number of connections in the process flow diagram.

Sub-diagram Node: This node enables the user to group a portion of a diagram into a sub-diagram. The user can organize very complicate process flow easier with sub-diagram node.

Section 2.3 Sample Section 2.3.1 Input Data Source node:

• Click SELECT to select the data set In this example, you can choice the library, "DATALEC1" and data set, "DONORS".

• This data set has 6,974 observations (rows) and 21 variables (columns).

7

There is a metadata sample that has 2000 observations.

The enterprise Miner utilizes metadata in order to make a preliminary assessment of how to use each variable. By default, it takes a random sample of 2000 observations from the entire data set.

Any numerical variable with more than 10 distinct levels in the metadata sample will assign the measurement level to interval and the model role to input by default.

Any variable with two non-missing levels in the metadata sample will assign the model role to input and the measurement level to binary by default.

Any character variable that has more than two non-missing levels will assign the measurement level to nominal and model role to input.

Any numerical variable with more than two and less than ten distinct levels in metadata sample will assign the measurement levels to ordinal and the model role to input.

Any variable with only one non-missing level will assign the measurement level to unary and the model role to rejected.

For a variable with nominal measurement level and has distinct value for every observation in the metadata sample will assign the model role to id.

The metadata sample is not able to assign model role target to any variable. You have to set the model role to target by yourself.

Observe that there are two gray columns, “Name” and “Type”. These two columns represent information from the SAS data set that cannot be changed in this node. All the rest information can be changed.

• Inspecting Distributions with Metadata Sample You can do some data exploration in this node as well. For example, you can view the distribution of TARGET_B.

8

Select the “Interval Variables” tab to view the descriptive statistics for interval variables

Select the “Class Variables” tab to view the descriptive statistics for class variables such as PET and PCOWNERS

9

The main goal for inspecting variables in “Data Source” node is to specify the correct model role and measurement level. If you want to explore the variable, you can use Exploration nodes such as “INSIGHT” node.

• Modifying Model Role and Measurement Levels

Modify the model role and measurement levels for PETS and PCOWNERS Set the model role for both variables to input and the measurement levels to

binary Set the model role for TARGET_B to “target” and the model role for

TARGET_D to “rejected”

10

• Using the Target Profile When building predictive models, the "best" model often varies according to the criteria used for evaluation or comparison. One commonly used criterion to choice the best model is the one that most accurately predicts the response. However, this criterion might not be suitable in many cases. The recommended criterion to choice the best model is the one that generates the highest expected profit based on user defined profit matrix.

Specify the Profit Matrix: For example, the cost of sending someone a mailing is $0.68 and the median donation for a donor is $13.00. Therefore, to send a mailing to someone that would not respond costs $0.68, but failing to mail to someone that would have responded costs $12.32 in lost revenue.

Specify the Prior Probabilities: In addition, it is very important to have balance data in each category of the target variable in building a "good" predictive model. Typically, one would like to over sample the category that has less than 50% in the population. For example, the response rate in the population is 5%. However, the data used in building predictive model has about 50% response rate. Thus, we must specify the prior probabilities in the target profile to reflect this difference.

Set the Target Profile in Data Source Node:

Position the mouse over the row for TARGET_B and right-click Select Edit Target Profile Select Yes if “no target profile found” message appears Enter “My Profile” as the description for the new profile (currently called

profile) Position the mouse over the newly create profile and right-click and select

“Set to use” if it is not in use Select the Target tab to see the information about the target variable,

Target_B Select the Levels sub tab to view the frequency information of the target

variable Select the Assessment Information tab to add the Profit Matrix

1. Right-click in the open area where the vectors and matrices are listed and select Add

2. A new Profit Matrix is formed 3. Type My matrix is the name field and press the Enter key 4. Type 12.32 in (1,1) cell, -0.68 in (0,1) cell, and 0 in (0,0) cell to reflect

the profit discussed 5. Right-click on My matrix and select Set to use 6. You can select the Edit Decisions subtab when you want to change the

default to maximum profit with cost or minimum loss options

11

Select the Prior tab to modify the prior probabilities 1. Right-click in the open area where the prior profiles are activated and

select Add to add a new prior vector 2. Highlight the new prior profile by selecting Prior vector 3. Modify the prior vector to represent the true proportion in the

population (0.05 when target value is 1 and 0.95 when target value is 0)

4. Right-click the Prior vector and select Set to Use

Section 2.3.2 Partition node Data partition is necessary in building a predictive model because you need to reserve some data to honestly evaluate the performance of the model. Since some model

12

building tools such as neural network and decision tree need to use an additional data set to monitor and tune the model building process, an additional sub-dataset is needed. By default, Enterprise Miner takes a simple random sampling to divide the input data set into training, validation, and testing data sets. However, it also supports stratified random sampling and user defined sampling as well. In addition, you can use either “Clustering” or “SOM/Kohonen” node before the “Data Partition” node to perform cluster sampling. In data partition nodes, you can also select the seed for initializing the sampling process. If you use the same data with the same seed in different flows, you will get the same partition. In order to get same results as the instructor, we will not change the seed here. You can also select the percentage of the data to be allocated to training, validation, and testing data sets. Since the data used in this example only have 6974 observations, you will split this data set into two data sets, one for training (50%) and another for validation (50%).

Section 2.4 Data Exploration with Insight Node By default the Insight node will select a random sample of 2000 observations from the training data set. In SAS Enterprise Miner, the name of the training data set created by the Data Partition node will have a prefix TRN and other five random alphanumeric characters. Similarly, the name of the validation data set will have a prefix VAL followed by five random alphanumeric characters.

13

In this example, you might want to use the entire training data set to perform preliminary examination because the entire training data set only has 3487 observations. You can select the radio button next to Entire data set to run Insight using the entire training data set.

After run the Insight nodes, you can look at the distribution for each of the variables as follows:

1. Select Analyze -> Distribution (Y) 2. Highlight all the variables except IDCODE in the variable list 3. Select Y 4. Select IDCODE 5. Select Label 6. Select OK

14

Charts for continuous variables include histograms, box and whisker plots, and assorted descriptive statistics. Charts for categorical variables include histograms. Some general comments about these distributions are as follows:

15

1. AGE: roughly symmetric and no need to do any transformation 2. HOMEOWNR: There are three categories, "H" (home owner), "U"

(unknown), and "Missing". Since both missing and unknown provide same information, we can combine these two categories together.

3. INCOME: INCOME should be treated as a numerical variable, i.e., has the measurement levels should be set to interval.

4. GENDER: There are more females than males in the training data set. The missing level for GENDER should be recoded to FEMALE.

5. MALEMILI: The MALEMILI is a very skew numerical variable. It might be better to be binned into three or four bins.

6. MALEVET: This variable is symmetric if there is not a spike close to 0. 7. LOCALGOV, STATEGOV, and FEDGOV: All three variables are skew to

the right and a log transformation is necessary. 8. PETS and PCOWNERS: The missing values for both variables should be

recoded to "U", unknown. 9. CARDPROM and NUMPROM: Both variables look fine. 10. CARDGIFT and TIMELAG: Both are skew to the right. A log transformation

is necessary. 11. AVGGIFT: This variable is very skew. The best way to handle this variable

is to use bins. 12. LASTT: It is okay. 13. FIRSTT: It has an extreme large value that can be deleted. 14. TARGET_B and TARGET_D: It does not matter so far.

Section 2.5 Modify Data exploration offers you an opportunity to explore the data. Typically, you can find some problems about the data such as missing value, skewness, and outliers. In this example, MALEMILI, LOCAOLGOV, STATEGOV, FEDGOV, CARDGIFT. TIMELAG, and AVGGIFT are very skew. Before building any predictive model, you can use “Variable Transformation” node first. Section 1.5.1 Modify with Variable Transformations Node: First, you look at the distribution of the variable, AVGGIFT. Since the distribution is very skew, you decide to use some types of binning methods to perform the necessary transformation. Three binning methods are available in SAS:

• Bucket : It creates cutoffs at approximately equally spaced intervals. • Quantiles: It creates bins with approximately equal frequencies. • Optimal Binning for Relationship with Target variable: The Optimal

Binning for Relationship to Target transformation uses the DMSPLIT procedure to optimally split a variable into n groups with regards to a binary

16

target. This binning transformation is useful when there is a nonlinear relationship between the input variable and the binary target. An ordinal measurement level is assigned to the transformed variable. To create the n optimal groups, the node applies a recursive process of splitting the variable into groups that maximize the association with the target values. To determine the optimum groups and to speed processing, the node uses the metadata as input.

To create the first split (grouping), the node divides the variable values into 64 equally spaced bins. The 64 bins are then split into 2 groups at the point that maximizes the Chi-square value on a Chi-square test of independence. The binary target values form the rows of the 2X2 table and the group values form the columns. For example, the maximized Chi-square value may be obtained by dividing the 64 bins into one group that contains 20 bins and another group that contains 44 bin.

Example Contingency Table for the Chi-square Test Binary Target Group 1 (20 bins) Group 2 (44 bins)

Yes 180 40

No 84 200

The split (grouping) is made if the Chi-square value exceeds a cutoff value of 3.84. If the Chi-square value for the initial split does not exceed the cuttoff value, then the input variable is not transformed, and the process stops.

After using any binning method to perform transformation, a new variable with ordinal measurement level is created to replace the original variable. First, you can use the bucket-binning method to bin variable MALEMILI. Suppose the optimal choice is to bin values into the interval 0-0, 0-4, and 4+. You can take the following steps to do so.

1. Position the mouse over the row of MALEMILI 2. Right-click and choose Transform -> BUCKET 3. Change the number of buckets to 3 4. Select Close 5. Enter 0 in the value field for BIN 1 6. Enter 4 in the value field for BIN 2 7. Close the plot to return to the previous window

The distribution for the transformed variable looks much better.

17

Next, you need to perform the log transformation on LOCALGOV, STATEGOV, FEDGOV, CARDGIFT, and TIMELAG because the distributions of all these variables are skew to the right. Finally, you can the optimal binning method to bin the variable AVGGIFT. The initial number of bins is set to 4 to get the optimal number (three) of bins.

18

Section 2.5.2 Modify with Replacement Node: The Replacement node allows you to perform missing value imputation. This replacement is necessary to utilize all of the observations in the training data for building a regression or neural network model. Otherwise, any observation with missing values on any of the input variables will be ignored when building a regression or neural network model. We will create imputed indicator variables for all variables with missing values before performing missing value imputation. Otherwise, we will not be able to study the missing value pattern. To do so, you need to check the box for Create imputed indicator variables and use the arrow to change the Role field to input as shown below. Each of the missing value indicator variable has a prefix "M_" and have a value of "1" when an observation has a missing value for the associated variable and "0" otherwise.

19

Since we want to use the entire training data set for data imputation, we need to follow the following steps to change the default setting:

1. Select the Data subtab 2. Select the Training subtab under the Data tab 3. Select the button next to entire data set 4. Return to the Default tab

Now, we start to apply imputation method to impute missing values. SAS provides the following imputation methods for interval variables: 1. Mean(default): the arithmetic average 2. Median: the 50th percentile 3. Midrange: the maximum + minimum divided by two 4. Distribution-based: replacement values are calculated based on the random

percentiles of the variable's distribution 5. Tree Imputation: replacement values are estimated using a decision tree based on the

remaining input and rejected variables that have a status of use as the predictors 6. Tree Imputation with Surrogates: same as above except using surrogate variables for

splitting when ever a split variable has a missing values 7. Mid-min spacing: 8. Tukey's biweigh, Huber's, and Andrew's wave: these are robust M-estimators of the

location 9. Default constant: set a default constant for some or all variables 10. None: turn off the imputation for all interval variables For classification variables, SAS provides six imputation methods, most frequent values, distribution based, tree imputation, tree imputation with surrogate, default constant, and none. In this example, we use tree imputation as the default imputation method for all variables. To do so, you need to click the Imputation methods subtab under the Defaults tab and change the default method to tree imputation for both interval variables and class variables. After you make the above changes, it should look like

20

Suppose we want to change the imputation method for AGE to mean and CARDPROM to a fix constant 20. We can do so as follows: 1. Select the Interval Variables tab 2. Position the mouse on the row for AGE in the Imputation Method column and right-

click 3. Select Select Method -> Mean 4. Position the mouse on the row for CARDPROM in the Imputation Method column

and right-click 5. Select Select Method -> Set Value 6. Type 20 in the New value field 7. Specify None as the Imputation method for TARGET_D 8. Select OK When you complete, the window should look like

21

Suppose we want to set the missing value for HOMEOWNR to "U" and use the default constant "U" on PCOWNERS and PETS. We need to do the following 1. Select the Constant values subtab under the Default tab 2. Enter U in the field of Character variables 3. Select Class Variables tab 4. Right click on the row for HOMEOWNR in the Imputation methods column 5. Select SELECT method -> Set value 6. Select the radio button next to Data Value 7. Use the arrow to choose "U" from the list of data values 8. Right click on the row for PETS in the Imputation methods column 9. Select SELECT method -> Default constant 10. Right click on the row for PCOWNERS in the Imputation methods column 11. Select SELECT method -> Default constant 12. Right click on the row for TARGET_B in the Imputation methods column 13. Specify None as the Imputation method for TARGET_B 14. Select OK When you complete the window should look like

22

Note: Sometimes tree imputation will fail due to some unknown reasons. If tree imputation failed, you can change the imputation to median for interval variables and frequency for class variables. In this example, I found that FIRSTT, MALEVET, STAT_XXX(Replace STATEGOV), TIME_XXX (Replace TIMELAG), and AVGG_XXX (replace AVGGIFT) can not use tree imputation, nor tree imputation with surrogate.

23

Section 2.6 Fitting Candidate Models In this section, we fit three models. The first model is a logistic regression model. After open the Regression node, you can select Selection Methods tab to choice a criterion to select variables. You can choose from the following variable selection techniques:

1. Backward: At the beginning, all the candidate effects are in the model and then systematically remove effects that are not significantly associated with the target variable until no other effect in the model meets the Stay Significant Levels or until the Stop criterion is met

2. Forward: At the beginning, there is not any effect in the model and then systematically adds effects that are significantly associated with the target variable until none of the remaining effects meet the Entry Significant Levels or the until the Stop criterion is met.

3. Stepwise: Stepwise selection begins, by default, with no candidate effects in the model and then systematically adds effects that are significantly associated with the target. However, after an effect is added to the model, Stepwise may remove any effect already in the model that is not significantly associated with the target variable anymore.

4. None (default): All candidate effects are included in the final model. In this example, we use the Stepwise variable selection method. We set the significant levels for both entry and Stay to 0.05. We leave all other options untouched. When you close the Regression node, you can change the model name to "STEPREG1".

We can then add default Tree node to fit a tree model and default neural network node to fit a neural network model.

24

Section 2.7 Assessing Candidate Models Add an Assessment node and Select Run. When the computation is complete, you are prompted to see the Assessment results. First, we can look at the Cumulative %Response Chart for the regression model. To understand this chart, you need to know how the chart is constructed. 1. For this example, a responder is defined as someone who makes a donation

(TARGET_B=1). For each person, the fitted model predicts the probability that person will donate. Sort the observations based on the predicted probability of the response from the highest probability to the lowest probability of response.

2. Group the people into ordered bins, each contains about ten percent of the data. 3. Using the target variable TARGET_B, count the percentage of actual responders in

each bin.

If the model is useful, the proportion of responders will be relatively high in bins with higher predicted probabilities. For this example, the percentage of responders in the first decile was 9.88% that is higher than the 5% population response rate. Actually, the lift values for the first four deciles are greater than 1. (Note: Lift value = the percentage of actually response rate over the population response rate) Also, we can use the Cumulative %Captured Response to answer the question such as "What percentage of the total number of responders is in a bin?" In this example, you can capture about 33% of the responders by just mailing to the top twenty percent of the observations.

25

Select the View Lift Data icon on the toolbar and inspecting the following table.

26

You can compare the lift chart for all three models. Based on the Lift chart, the regression model is much better than either the tree model or the neural network model.

After building models, you can either export the scoring code to score a new data set with SAS/BASE or score the new data set within Enterprise Miner. There are four Settings options available in “Score” node: 1. Inactive (default): This option will export the most recently created scored data sets. 2. Apply Training data score code to score data set: This option allows you to score a

data set within the Enterprise Miner. 3. Accumulate data sets by type: 4. Merge data sets by type: This will merge data sets imported from predecessor nodes.

For example, you can use this option to merge the training data sets from all three models to compare their predicted values.

To see the score code, you can click the Score code tab and then double-click the desired model. If you want to save the scoring core for the regression model, you need to follow these steps: 1. Right-click the selected model (regression model) and select Save.

27

2. Enter a name such as My Regression model Version I in the name field and click OK.

3. The code now is saved inside the Enterprise Miner. To use the code outside of Enterprise Miner in a SAS session, you need to export the scoring code from Enterprise miner. You can export the code as follows: 1. Highlight the name representing the desired code in the list 2. Right-click on the highlighted name and select Export 3. Enter a name for the saved program such as ScoreCodeLecture1 and select Save You can then use the saved scoring code to score a new data set using the Base SAS. To do so, proceed as follows: 1. Select window -> Program Editor 2. Select File -> Open 3. Select ScoreCodeLecture1 4. Add two SAS statements "%let _Score=STA5703.myscore;" and "%let _predict=X;"

at the beginning of the program in order to score the data set myscore in STA5703 library

5. Add the following statements at the end of the file to print the output PROC PRINT DATA=&_PREDICT; VAR IDCODE P_TARGET_B1; RUN;

6. Submit the program and the result should look like the following table. In this case, you can send mail to anyone whose cutoff probability is greater than 0.55. There are about 41.12% (5255/12765) of all candidates get mail.

Obs IDCODE B1 7498 58384 0.054913 7499 56597 0.054928 7500 45280 0.054935 7501 61524 0.054937 7502 42932 0.054939 7503 46355 0.054940 7504 52330 0.054942 7505 44894 0.054945 7506 47132 0.054965 7507 57854 0.054969 7508 45334 0.054987 7509 61307 0.054991 7510 45788 0.054992 7511 55601 0.055004 7512 47473 0.055005 7513 57435 0.055005 7514 50771 0.055009 7515 56584 0.055009 7516 47051 0.055010 7517 48856 0.055010

28

Add the Input Data Source node 1. Select the data set “myscore” in the “STA5703” library 2. Change the role of data set to SCORE 3. Open the SCORE! Node and select the radio button next to Apply training data

score code to score data set 4. Select the Data tab and then select the “Select” button to see a list of predecessors 5. Use the browse to choice the data set associated with the regression node 6. Select OK to accept this selection 7. Add another INSIGHT node 8. Open INSIGHT node 9. Select the radio button next to “Entire Data Set” in order to use the whole data set 10. Choose the Select button on the Data tab to select the data set associated with the

score data. (This data set has a prefix SD.) 11. Run the INSIGHT node and view the result. You can now use the Reporter node to create a HTML report for this process. The final diagram should look like

29

Appendix 2.1 Data used in Lecture 2 The data is from a nonprofit organization that relies on fundraising campaigns to support their efforts. After analyzing the original data, a subset of 19 predictor variables was selected to model the response to a mailing. Two response variables were stored in the data set – DONOR. One response variable related to whether or not someone responded to the mailing (TARGET_B), and the other response variable measured how much the person actually donated in U.S. dollars (TARGET_D). The output form PROC CONTENTS is as follows:

The CONTENTS Procedure Data Set Name: DATALEC1.DONORS Observations: 6974 Member Type: DATA Variables: 21 Engine: V6 Indexes: 0 Created: 21:07 Monday, July 2, 2001 Observation Length: 173 Last Modified: 21:07 Monday, July 2, 2001 Deleted Observations: 0 Protection: Compressed: NO Data Set Type: Sorted: NO Label: -----Engine/Host Dependent Information----- Data Set Page Size: 16384 Number of Data Set Pages: 75 First Data Page: 1 Max Obs per Page: 94 Obs in First Data Page: 77 Number of Data Set Repairs: 0 File Name: C:\Morgan\LectureData\donors.sd2 Release Created: 6.08.00 Host Created: WIN

30

Appendix 2.2 References for Lecture 2 SAS Institute (2000), Data Mining Using Enterprise Miner Software: A Case Study Approach, First Edition, SAS Institute, Cary, NC. SAS Institute (2000), Enterprise Miner: Applying Data Mining Technique Course Notes, SAS Institute, Cary, NC. SAS Institute (2000), Getting Started with Enterprise Miner Software, Release 4.1, SAS Institute, Cary, NC.

-----Alphabetic List of Variables and Attributes----- # Variable Type Len Pos Label ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 1 AGE Num 8 0 Donor's Age 16 AVGGIFT Num 8 120 Average Gift Size 14 CARDGIFT Num 8 104 # Donated after Card Promotions 12 CARDPROM Num 8 88 Number of Card Promotions 9 FEDGOV Num 8 64 % of Household in Federal Government 20 FIRSTT Num 8 152 Elapsed Time since First Donation 4 GENDER Char 8 24 F=Female; M=Male 2 HOMEOWNR Char 8 8 H=homeowner; U=Unknown 21 IDCODE Char 13 160 ID Code; unique for each Donor 3 INCOME Num 8 16 Income Level; Classified into 0 to 9 19 LASTT Num 8 144 Elapsed Time since Last Donation 7 LOCALGOV Num 8 48 % of Household in Local Government 5 MALEMILI Num 8 32 % of Household Males Active in the milit 6 MALEVET Num 8 40 % of Household Male Veterans 13 NUMPROM Num 8 96 total number of Promotions 11 PCOWNERS Char 8 80 Y=Doner Owns Computer; Missing=Otherwise 10 PETS Char 8 72 Y=Donor Owns Pet; Missing=Otherwise 8 STATEGOV Num 8 56 % of Household in State Government 17 TARGET_B Num 8 128 1=Donor; 0=Not Contribute 18 TARGET_D Num 8 136 Dollar Amount of Contribution 15 TIMELAG Num 8 112 Time between First and Second donation