data mining using sas

Group Assignment Data Mining MKTG 5963

Abhinav Garg (11761380)

Tanu Srivastav (11772446)

Tejbeer Chhabra (11756746)

Maunik Desai (11758140)

Maanasa Nagaraja (11678486)

Table of Contents

Executive Summary ............................................................................................................................1

Data Audit .........................................................................................................................................1

Modeling ...........................................................................................................................................2

Model Comparison.............................................................................................................................4

Scoring ..............................................................................................................................................4

Segmentation ....................................................................................................................................4

Conclusion .........................................................................................................................................5

Appendix A : Data Exploration ............................................................................................................ i

Appendix B: Clustering ...................................................................................................................... iii

Appendix C: Data Modeling ............................................................................................................... vi

Appendix D: MODEL COMPARISON ................................................................................................... xi

Appendix E: Scored Data .................................................................................................................. xii

Contents for Table Table 1 Variable Worth in Clusters ............................................................................................................... 2

Table 2 Sensitivity and Specificity for Forward Regression Model ............................................................... 3

Table 3 Sensitivity and Specificity for Stepwise Regression ......................................................................... 3

Table 4 Sensitivity and Specificity for Neural Network ................................................................................. 3

Table 5 Model Comparisons ......................................................................................................................... 4

Table 6 Scored Data Summary for Target Variable ....................................................................................... 4

MKTG 5963 Data Mining Group Assignment

1

Executive Summary Diversity and SAT score plays an important role in creating a better learning environment and good college

experience for the students. Diversity enriches the educational experience and promotes personal growth. SAT

score is a useful predictor of college academic performance.

In our analysis, we aim to identify prospective students who would most likely enroll as new freshmen in Fall 2005.

Also we would focus on marketing strategy for administration to increase diversity and SAT score.

Data Audit Before performing data modeling it is critical to perform data exploration to find interesting insights about the

data.

1. DMDB Node

The DMDB tool gave us quick insights to understand our data better in the form of the summary statistics for

numerical variables, the number of categories for class variables, and the extent of the missing values in the data.

From the results, it is apparent that the categorical variables are not missing and interval variables have missing

values and distance, hscrat, init_span, init1rat, init2rat exhibit non-normal behavior, which can further introduce

biases.

2. Data Reduction

Variables ACADEMIC_INTEREST_1 and ACADEMIC_INTEREST_2 have their counterpart in INT1RAT and INT2RAT

respectively. Similarly IRSCHOOL was converted into HSCRAT. TELECQ had more than 50% missing values.

TOTAL_CONTACTS is nothing but summation of various other form of contact counts. CONTACT_CODE1 has

hundreds of levels and specifically such code doesn’t provide much information. For these reasons,

ACADEMIC_INTEREST_1, ACADEMIC_INTEREST_2, IRSCHOOL, TELECQ, TOTAL_CONTACTS, and CONTACT_CODE1

were all removed from dataset.

3. Missing Value Imputation

Since our interval variables had lots of missing values we used the PROC MI procedure to impute the

missing values rather than traditional imputation methods, which creates unknown biases in data. The

PROC MI procedure allows both finding the patterns of missing data and imputation. It simulates and

generates multiple complete dataset from the original data with missing values by repeatedly replacing

missing entries with imputed ones.

4. Data Filter Node

Extreme values are problematic as they may have undue influence on the model. We handled extreme

values by excluding observations including outliers or other extreme values that we don’t want to

include in our model. This also further improves the skewness and brings the variables closer to normal

distribution. The filtering methods for the interval and class variables used are Standard Deviations from

the mean and Rare Values (Percentage) respectively.

5. Data Partitioning

Before building our models, we split the data into training (70%) and validation (30%). We choose 70–30 spit

because this is the sweet spot hit for honest assessment. Also 70–30 split provided similar proportion of our target

as in the original dataset. A summary of the split has been provided in the appendix.


2

6. Data Transformation

Data Transformation corrects for skewed distribution of the numerical input variables and large number of classes

in the categorical variables. From the skewness and kurtosis values obtained after filtering, the independent

variables exhibit approximately normal distribution. We performed data transformation using “Maximum Normal”

for the independent variables, which is one of the best power transformations techniques that belongs to Box_Cox

transformation, to analyze its effectiveness in reducing skewness and kurtosis.

Although the skewness values have dropped, the decrease is not significant enough to use this methodology. For

instance, HSCRAT shows skweness of 2.64 and after log transformation, as suggested by max normal, the skweness

is 1.9. Moreover, the transformations bring in their own challenges. Transformed variables come with a cost, since

they are complicated to interpret (log, square root) especially in a business scenario. Therefore, we chose not to

perform any transformation.

Modeling We have used Decision Tree, Forward and Stepwise Regression, Neural Network, and Auto Neural data

modeling techniques.

1. Decision Trees

Decision tree methodology is a commonly used data mining method for establishing classification systems based

on multiple covariates or for developing prediction algorithms for a target variable. A split search algorithm

facilitates input selection. Model complexity is addressed by pruning. The setting used for decision tree node are

Maximum Branch = 2, Maximum Depth = 6, Minimum Leaf Size = 5, and we use the assessment method and

misclassification as assessment measure.

Below is the variable importance report –

Variable Name Importance

SELF_INIT_CNTCTS 1.0000

HSCRAT 0.3798

STUEMAIL 0.2767

INIT_SPAN 0.1404

MAILQ 0.0816

INEREST 0.0698

INT1RAT 0.0638

Table 1 Variable Worth in Clusters

Model Assessment: Validation Misclassification Rate = 0.0512 and Training Misclassification Rate = 0.058.

2. Regression

Since our dependent variables ENROLL is a binary categorical variable, the type of regression chosen is Logistic.

2.1 Forward regression

Forward Regression creates a sequence of models of increasing complexity. At each step, each variable that is not

already added is tested for inclusion in the model. The most significant of these variables is added to the model, as

long as their p-values are below the SLENTRY = 0.05. The variables selected are SELF_INIT_CNTCTS, STUEMAIL,

HSCRAT, INIT_SPAN, DISTANCE, SATSCORE, MAILQ, and INT2RAT. We are also studying interaction effect through

our regression model. Following are the significant interactions are:

SELF_INIT_CNTCTS was found to be the

most important variable in determining

enrollment decision of prospective

student.

The Minimum Misclassification Rate was

found at Number of Leaves = 12.


3

CAMPUS_VISIT * PREMIERE

REFERAL_CNTCTS * INTEREST

INTEREST * PREMIERE

INSTATE*MAILQ

TRAVEL_INIT_CNTCTS*INTEREST

TERRITORY*STUEMAIL

Model Assessment: Validation Misclassification Rate = 0.0786 and Training Misclassification = 0.073.

Sensitivity Specificity

91% 93% Table 2 Sensitivity and Specificity for Forward Regression Model

2.2 Stepwise Regression

The stepwise regression combines elements from both the forward and backward selection procedures. The final

variable selected is SELF_INIT_CNTCTS.

Logit (Enroll = 1) = -2.956 + 1.030(SELF_INIT_CNTCTS)

With every unit increase in SELF_INIT_CNTCTS the odds of success of enrollment increases by 2.802 holding

everything else constant.

Model Assessment: Validation misclassification rate = 0.106 and Training misclassification rate = 0.113.


89.7% 88.9% Table 3 Sensitivity and Specificity for Stepwise Regression

Conclusion: Out of two regression models, based on validation misclassification rate, Forward regression gives

better results.

3. Neural Network

In Neural network, the prediction formula is similar to a regression, but with flexible addition. Neural Network

does not easily address input selection. For this reason, we have used the selected variable from Forward

regression as input variables to Neural Network model.

3.1 Neural Network

The convergence criterion was not met by the default setting of neural network with 3 hidden units and then we

reduced the Number of Hidden units to 2. With 2 Hidden Units, convergence criterion was satisfied. This gave the

validation misclassification rate as 0.071 and for training misclassification rate is 0.072.


94.77% 91.15% Table 4 Sensitivity and Specificity for Neural Network

3.2 Auto Neural

The Auto Neural tool offers an automatic way to explore alternative network architectures and neural networks

with increasing hidden unit counts. The block of output below summarizes the training process. Fit Statistics from

the iteration with the smallest validation misclassification are shown for each step. Refer Appendix C for Results.


4

Model Comparison The Model Comparison Node is a great tool, which helps us to evaluate the best model in terms of various fit statistics.

It is apparent from the summary of the fit statistics

that Decision Tree ranks the best model with the

least Training and Validation misclassification rate.

The validation misclassification rate is 5.12%. We

would further use this model to SCORE the

SCORE_DATA set.

Scoring Since our data is balanced, prediction estimates reflect target population in the training sample and not the

population. Therefore score ranking plots are inaccurate and misleading. To fix this we have adjusted for separate

sampling by adjusting prior probability as 3.1% for primary outcome.

Table 6 Scored Data Summary for Target Variable

Segmentation The administration is interested in increasing enrollment, diversity and SAT score. The best way to do this is to perform clustering, which divides the data set into mutually exclusive groups with varied diversity. Also by doing this we can identify the group of prospective students with high SAT score and thus administration can focus on marketing strategy for target group of students. The node chose 3 clusters, and the relative size of each cluster is shown in the below pie chart.

Figure 1 Segmentation Pie Chart

Model Comparison Validation Misclassification Rate

Decision Tree 0.0512

Auto Neural 0.0663

Neural Network 0.0714

Forward Regression 0.0786

Stepwise Regression 0.1067

Table 5 Model Comparisons


5

DISTANCE was found to be most important factor in differentiating three segments.

Cluster 1: SOLICITED_CNTS and INSTATE are the most important predictor variable. Avg. SAT SCORE = 1143.3.

Males to Females proportion are 59% - 41%.

Cluster 2: DISTANCE and INSTATE are the most important predictor variable. Avg. SAT SCORE = 1094.90. Males to

Females proportion are 60% - 40%.

Cluster 3: INSTATE and INT2RAT are the most important predictor variable. Avg. SAT SCORE = 1127.00. Males to

Females proportion are 66% - 34%.

Conclusion To increase the SAT score for prospective enrollement, administration should target on cluster 1 students as they

have highest average score. Also this group shows diversity in terms of Ethinicity as well as sex. Even with Ethiniciy

C overwhelming data, we find that cluster 1 has people from almost all diversity in some proportion. If we look at

the diversity of gender in cluster 1, we find that it has approximately equal proportion of males (59%) and

females(41%).

Figure 2 Modeling Diagram

Based on best fitted model i.e. Decision Tree, we found that the most important independent variables

are SELF_INT_CNTCTS and HSCRAT. This suggests that administration should focus on addressing

students, who themselves initiated contact and who belongs to the high school which has highest last 5

years enrollment. If administration could come up with special welcome kit or some sort of welcome

offer for these students, they could turn up more enrollments.


i

Appendix A : Data Exploration

Variable Standard Deviation Skewness Kurtosis

Missing Values

Enroll 0.500485 1.10E-16 -2.0007756 0

TOTAL_CONTACTS 3.480081 1.0517156 0.8543522 0

SELF_INT_CNTCTS 3.0988946 0.8850357 0.305132 0

TRAVEL_INT_CNTCTS 0.6702278 1.6645745 3.673759 0

SOLICITED_CNTCTS 0.7613853 1.8541231 8.1394438 0

REFERRAL_CNTCTS 0.288625 5.9594486 50.9889296 0

CAMPUS_VISIT 0.3774713 2.3172482 4.4588036 0

SATSCORE 151.4914425 -0.2413784 0.462179 1887

SEX 0.4860401 -0.4837916 -1.7666479 127

MAILQ 1.6001673 -0.8884705 -0.9904749 0

TELECQ 0.8074666 0.9573248 0.7499086 3055

PREMIERE 0.4094565 1.4024778 -0.033069 0

INTEREST 0.4118758 2.3363909 5.0932373 0

STUEMAIL 0.4379744 -1.1022225 -0.7854101 0

INIT_SPAN 9.1778057 -2.4740469 84.078123 0

INT1RAT 0.0358866 9.3802148 207.82226 0

INT2RAT 0.039164 7.2199216 111.921867 0

HSCRAT 0.1457441 4.4813438 23.4212246 0

AVG_INCOME 23083.61 0.9640883 0.9048331 763

DISTANCE 370.781848 2.580719 10.6365752 671 Table 7 Data Exploration for Given Data set

Table 8 Skewness and Kurtosis Results after Filtering


ii

Table 9 Filtering Results for Class Variables

Table 10 Filtering Results for Interval Variables

Figure 3 Excluded Observations after Filtering

Figure 4 Partition Summaries


iii

Appendix B: Clustering

Table 11 Ethnicity distributions for each Segment

Table 12 Segmentation Report

1. Segment 1:

Table 13 Variable Worth for Segment 1


iv

Table 14 Worth Plot for Segment 1

2. Segment 2:

Table 15 Variable worth for Segment 2

Figure 5 Worth Plot for Segment 2


v

3. Segment 3:

Table 16 Variable Worth for Segment 3

Figure 6 Worth Plot for Segment 3

Table 17 Overall variable importances in Clustering


vi

Appendix C: Data Modeling 1. Decision Tree

Figure 7 Subtree Assessment Plot

Table 18 Fit Statistics for Decision Tree


vii

Figure 8 English Rules for Decision Tree

Figure 9 Decision Tree


viii

2. Forward Regression

Figure 19 Fit Statistics for Forward Regression

Figure 10 Mode Iteration plot for Forward Regression


ix

3. Stepwise Regression

Table 20 Fit Statistics Report for Stepwise Regression

Figure 11 Model Iteration Plot for Stepwise Regression


x

Figure 12 Summary of Stepwise Selection

4. Neural Network:

Table 21 Fit Statistics Table for Neural Network

Figure 13 Optimization Summaries for Neural Network


xi

5. AUTO NEURAL

Table 22 Fit Statistics for Auto Neural Network

Appendix D: MODEL COMPARISON

Table 23 Model Comparison Summaries


xii

Figure 14 Sensitivity and Specificity of Models

Appendix E: Scored Data

Table 24 Scored Data Summary for Target Variable

data mining using sas

Data & Analytics