data mining using sas
TRANSCRIPT
Group Assignment Data Mining MKTG 5963
Abhinav Garg (11761380)
Tanu Srivastav (11772446)
Tejbeer Chhabra (11756746)
Maunik Desai (11758140)
Maanasa Nagaraja (11678486)
Table of Contents
Executive Summary ............................................................................................................................1
Data Audit .........................................................................................................................................1
Modeling ...........................................................................................................................................2
Model Comparison.............................................................................................................................4
Scoring ..............................................................................................................................................4
Segmentation ....................................................................................................................................4
Conclusion .........................................................................................................................................5
Appendix A : Data Exploration ............................................................................................................ i
Appendix B: Clustering ...................................................................................................................... iii
Appendix C: Data Modeling ............................................................................................................... vi
Appendix D: MODEL COMPARISON ................................................................................................... xi
Appendix E: Scored Data .................................................................................................................. xii
Contents for Table Table 1 Variable Worth in Clusters ............................................................................................................... 2
Table 2 Sensitivity and Specificity for Forward Regression Model ............................................................... 3
Table 3 Sensitivity and Specificity for Stepwise Regression ......................................................................... 3
Table 4 Sensitivity and Specificity for Neural Network ................................................................................. 3
Table 5 Model Comparisons ......................................................................................................................... 4
Table 6 Scored Data Summary for Target Variable ....................................................................................... 4
MKTG 5963 Data Mining Group Assignment
1
Executive Summary Diversity and SAT score plays an important role in creating a better learning environment and good college
experience for the students. Diversity enriches the educational experience and promotes personal growth. SAT
score is a useful predictor of college academic performance.
In our analysis, we aim to identify prospective students who would most likely enroll as new freshmen in Fall 2005.
Also we would focus on marketing strategy for administration to increase diversity and SAT score.
Data Audit Before performing data modeling it is critical to perform data exploration to find interesting insights about the
data.
1. DMDB Node
The DMDB tool gave us quick insights to understand our data better in the form of the summary statistics for
numerical variables, the number of categories for class variables, and the extent of the missing values in the data.
From the results, it is apparent that the categorical variables are not missing and interval variables have missing
values and distance, hscrat, init_span, init1rat, init2rat exhibit non-normal behavior, which can further introduce
biases.
2. Data Reduction
Variables ACADEMIC_INTEREST_1 and ACADEMIC_INTEREST_2 have their counterpart in INT1RAT and INT2RAT
respectively. Similarly IRSCHOOL was converted into HSCRAT. TELECQ had more than 50% missing values.
TOTAL_CONTACTS is nothing but summation of various other form of contact counts. CONTACT_CODE1 has
hundreds of levels and specifically such code doesn’t provide much information. For these reasons,
ACADEMIC_INTEREST_1, ACADEMIC_INTEREST_2, IRSCHOOL, TELECQ, TOTAL_CONTACTS, and CONTACT_CODE1
were all removed from dataset.
3. Missing Value Imputation
Since our interval variables had lots of missing values we used the PROC MI procedure to impute the
missing values rather than traditional imputation methods, which creates unknown biases in data. The
PROC MI procedure allows both finding the patterns of missing data and imputation. It simulates and
generates multiple complete dataset from the original data with missing values by repeatedly replacing
missing entries with imputed ones.
4. Data Filter Node
Extreme values are problematic as they may have undue influence on the model. We handled extreme
values by excluding observations including outliers or other extreme values that we don’t want to
include in our model. This also further improves the skewness and brings the variables closer to normal
distribution. The filtering methods for the interval and class variables used are Standard Deviations from
the mean and Rare Values (Percentage) respectively.
5. Data Partitioning
Before building our models, we split the data into training (70%) and validation (30%). We choose 70–30 spit
because this is the sweet spot hit for honest assessment. Also 70–30 split provided similar proportion of our target
as in the original dataset. A summary of the split has been provided in the appendix.
MKTG 5963 Data Mining Group Assignment
2
6. Data Transformation
Data Transformation corrects for skewed distribution of the numerical input variables and large number of classes
in the categorical variables. From the skewness and kurtosis values obtained after filtering, the independent
variables exhibit approximately normal distribution. We performed data transformation using “Maximum Normal”
for the independent variables, which is one of the best power transformations techniques that belongs to Box_Cox
transformation, to analyze its effectiveness in reducing skewness and kurtosis.
Although the skewness values have dropped, the decrease is not significant enough to use this methodology. For
instance, HSCRAT shows skweness of 2.64 and after log transformation, as suggested by max normal, the skweness
is 1.9. Moreover, the transformations bring in their own challenges. Transformed variables come with a cost, since
they are complicated to interpret (log, square root) especially in a business scenario. Therefore, we chose not to
perform any transformation.
Modeling We have used Decision Tree, Forward and Stepwise Regression, Neural Network, and Auto Neural data
modeling techniques.
1. Decision Trees
Decision tree methodology is a commonly used data mining method for establishing classification systems based
on multiple covariates or for developing prediction algorithms for a target variable. A split search algorithm
facilitates input selection. Model complexity is addressed by pruning. The setting used for decision tree node are
Maximum Branch = 2, Maximum Depth = 6, Minimum Leaf Size = 5, and we use the assessment method and
misclassification as assessment measure.
Below is the variable importance report –
Variable Name Importance
SELF_INIT_CNTCTS 1.0000
HSCRAT 0.3798
STUEMAIL 0.2767
INIT_SPAN 0.1404
MAILQ 0.0816
INEREST 0.0698
INT1RAT 0.0638
Table 1 Variable Worth in Clusters
Model Assessment: Validation Misclassification Rate = 0.0512 and Training Misclassification Rate = 0.058.
2. Regression
Since our dependent variables ENROLL is a binary categorical variable, the type of regression chosen is Logistic.
2.1 Forward regression
Forward Regression creates a sequence of models of increasing complexity. At each step, each variable that is not
already added is tested for inclusion in the model. The most significant of these variables is added to the model, as
long as their p-values are below the SLENTRY = 0.05. The variables selected are SELF_INIT_CNTCTS, STUEMAIL,
HSCRAT, INIT_SPAN, DISTANCE, SATSCORE, MAILQ, and INT2RAT. We are also studying interaction effect through
our regression model. Following are the significant interactions are:
SELF_INIT_CNTCTS was found to be the
most important variable in determining
enrollment decision of prospective
student.
The Minimum Misclassification Rate was
found at Number of Leaves = 12.
MKTG 5963 Data Mining Group Assignment
3
CAMPUS_VISIT * PREMIERE
REFERAL_CNTCTS * INTEREST
INTEREST * PREMIERE
INSTATE*MAILQ
TRAVEL_INIT_CNTCTS*INTEREST
TERRITORY*STUEMAIL
Model Assessment: Validation Misclassification Rate = 0.0786 and Training Misclassification = 0.073.
Sensitivity Specificity
91% 93% Table 2 Sensitivity and Specificity for Forward Regression Model
2.2 Stepwise Regression
The stepwise regression combines elements from both the forward and backward selection procedures. The final
variable selected is SELF_INIT_CNTCTS.
Logit (Enroll = 1) = -2.956 + 1.030(SELF_INIT_CNTCTS)
With every unit increase in SELF_INIT_CNTCTS the odds of success of enrollment increases by 2.802 holding
everything else constant.
Model Assessment: Validation misclassification rate = 0.106 and Training misclassification rate = 0.113.
Sensitivity Specificity
89.7% 88.9% Table 3 Sensitivity and Specificity for Stepwise Regression
Conclusion: Out of two regression models, based on validation misclassification rate, Forward regression gives
better results.
3. Neural Network
In Neural network, the prediction formula is similar to a regression, but with flexible addition. Neural Network
does not easily address input selection. For this reason, we have used the selected variable from Forward
regression as input variables to Neural Network model.
3.1 Neural Network
The convergence criterion was not met by the default setting of neural network with 3 hidden units and then we
reduced the Number of Hidden units to 2. With 2 Hidden Units, convergence criterion was satisfied. This gave the
validation misclassification rate as 0.071 and for training misclassification rate is 0.072.
Sensitivity Specificity
94.77% 91.15% Table 4 Sensitivity and Specificity for Neural Network
3.2 Auto Neural
The Auto Neural tool offers an automatic way to explore alternative network architectures and neural networks
with increasing hidden unit counts. The block of output below summarizes the training process. Fit Statistics from
the iteration with the smallest validation misclassification are shown for each step. Refer Appendix C for Results.
MKTG 5963 Data Mining Group Assignment
4
Model Comparison The Model Comparison Node is a great tool, which helps us to evaluate the best model in terms of various fit statistics.
It is apparent from the summary of the fit statistics
that Decision Tree ranks the best model with the
least Training and Validation misclassification rate.
The validation misclassification rate is 5.12%. We
would further use this model to SCORE the
SCORE_DATA set.
Scoring Since our data is balanced, prediction estimates reflect target population in the training sample and not the
population. Therefore score ranking plots are inaccurate and misleading. To fix this we have adjusted for separate
sampling by adjusting prior probability as 3.1% for primary outcome.
Table 6 Scored Data Summary for Target Variable
Segmentation The administration is interested in increasing enrollment, diversity and SAT score. The best way to do this is to perform clustering, which divides the data set into mutually exclusive groups with varied diversity. Also by doing this we can identify the group of prospective students with high SAT score and thus administration can focus on marketing strategy for target group of students. The node chose 3 clusters, and the relative size of each cluster is shown in the below pie chart.
Figure 1 Segmentation Pie Chart
Model Comparison Validation Misclassification Rate
Decision Tree 0.0512
Auto Neural 0.0663
Neural Network 0.0714
Forward Regression 0.0786
Stepwise Regression 0.1067
Table 5 Model Comparisons
MKTG 5963 Data Mining Group Assignment
5
DISTANCE was found to be most important factor in differentiating three segments.
Cluster 1: SOLICITED_CNTS and INSTATE are the most important predictor variable. Avg. SAT SCORE = 1143.3.
Males to Females proportion are 59% - 41%.
Cluster 2: DISTANCE and INSTATE are the most important predictor variable. Avg. SAT SCORE = 1094.90. Males to
Females proportion are 60% - 40%.
Cluster 3: INSTATE and INT2RAT are the most important predictor variable. Avg. SAT SCORE = 1127.00. Males to
Females proportion are 66% - 34%.
Conclusion To increase the SAT score for prospective enrollement, administration should target on cluster 1 students as they
have highest average score. Also this group shows diversity in terms of Ethinicity as well as sex. Even with Ethiniciy
C overwhelming data, we find that cluster 1 has people from almost all diversity in some proportion. If we look at
the diversity of gender in cluster 1, we find that it has approximately equal proportion of males (59%) and
females(41%).
Figure 2 Modeling Diagram
Based on best fitted model i.e. Decision Tree, we found that the most important independent variables
are SELF_INT_CNTCTS and HSCRAT. This suggests that administration should focus on addressing
students, who themselves initiated contact and who belongs to the high school which has highest last 5
years enrollment. If administration could come up with special welcome kit or some sort of welcome
offer for these students, they could turn up more enrollments.
MKTG 5963 Data Mining Group Assignment
i
Appendix A : Data Exploration
Variable Standard Deviation Skewness Kurtosis
Missing Values
Enroll 0.500485 1.10E-16 -2.0007756 0
TOTAL_CONTACTS 3.480081 1.0517156 0.8543522 0
SELF_INT_CNTCTS 3.0988946 0.8850357 0.305132 0
TRAVEL_INT_CNTCTS 0.6702278 1.6645745 3.673759 0
SOLICITED_CNTCTS 0.7613853 1.8541231 8.1394438 0
REFERRAL_CNTCTS 0.288625 5.9594486 50.9889296 0
CAMPUS_VISIT 0.3774713 2.3172482 4.4588036 0
SATSCORE 151.4914425 -0.2413784 0.462179 1887
SEX 0.4860401 -0.4837916 -1.7666479 127
MAILQ 1.6001673 -0.8884705 -0.9904749 0
TELECQ 0.8074666 0.9573248 0.7499086 3055
PREMIERE 0.4094565 1.4024778 -0.033069 0
INTEREST 0.4118758 2.3363909 5.0932373 0
STUEMAIL 0.4379744 -1.1022225 -0.7854101 0
INIT_SPAN 9.1778057 -2.4740469 84.078123 0
INT1RAT 0.0358866 9.3802148 207.82226 0
INT2RAT 0.039164 7.2199216 111.921867 0
HSCRAT 0.1457441 4.4813438 23.4212246 0
AVG_INCOME 23083.61 0.9640883 0.9048331 763
DISTANCE 370.781848 2.580719 10.6365752 671 Table 7 Data Exploration for Given Data set
Table 8 Skewness and Kurtosis Results after Filtering
MKTG 5963 Data Mining Group Assignment
ii
Table 9 Filtering Results for Class Variables
Table 10 Filtering Results for Interval Variables
Figure 3 Excluded Observations after Filtering
Figure 4 Partition Summaries
MKTG 5963 Data Mining Group Assignment
iii
Appendix B: Clustering
Table 11 Ethnicity distributions for each Segment
Table 12 Segmentation Report
1. Segment 1:
Table 13 Variable Worth for Segment 1
MKTG 5963 Data Mining Group Assignment
iv
Table 14 Worth Plot for Segment 1
2. Segment 2:
Table 15 Variable worth for Segment 2
Figure 5 Worth Plot for Segment 2
MKTG 5963 Data Mining Group Assignment
v
3. Segment 3:
Table 16 Variable Worth for Segment 3
Figure 6 Worth Plot for Segment 3
Table 17 Overall variable importances in Clustering
MKTG 5963 Data Mining Group Assignment
vi
Appendix C: Data Modeling 1. Decision Tree
Figure 7 Subtree Assessment Plot
Table 18 Fit Statistics for Decision Tree
MKTG 5963 Data Mining Group Assignment
vii
Figure 8 English Rules for Decision Tree
Figure 9 Decision Tree
MKTG 5963 Data Mining Group Assignment
viii
2. Forward Regression
Figure 19 Fit Statistics for Forward Regression
Figure 10 Mode Iteration plot for Forward Regression
MKTG 5963 Data Mining Group Assignment
ix
3. Stepwise Regression
Table 20 Fit Statistics Report for Stepwise Regression
Figure 11 Model Iteration Plot for Stepwise Regression
MKTG 5963 Data Mining Group Assignment
x
Figure 12 Summary of Stepwise Selection
4. Neural Network:
Table 21 Fit Statistics Table for Neural Network
Figure 13 Optimization Summaries for Neural Network
MKTG 5963 Data Mining Group Assignment
xi
5. AUTO NEURAL
Table 22 Fit Statistics for Auto Neural Network
Appendix D: MODEL COMPARISON
Table 23 Model Comparison Summaries
MKTG 5963 Data Mining Group Assignment
xii
Figure 14 Sensitivity and Specificity of Models
Appendix E: Scored Data
Table 24 Scored Data Summary for Target Variable