assignment_week 5_taskin shakib.docx

Taskin Shakib

This data describes the survival status of 1309 of the 1324 individual passengers on the Titanic.

Survived: Yes or NoPassenger Class: 1, 2, or 3 corresponding to 1st, 2nd, or 3rd classSex: Passenger sexAge: Passenger ageSiblings and Spouses: The number of siblings and spouses aboardParents and Children: The number of parents and children aboardFare: The passenger farePort: Port of embarkment (C = Cherbourg; Q = Queenstown; S = Southampton)

a. How many splits are in your final tree (Hint: Go)? Please include visualization of your split history, and short explanation on how SAS JMP arrived at this number of splits.

There are 6 splits in my final tree, based on 30% validation and 70% training. Figure 1 shows the coefficient of determination for both training and validation data.

Figure 1: Splits in Final Tree

The Split History report (Figure 2) shows how the R Square value changes for training and validation data after each split. The vertical line is drawn at the number of splits used in the final model, which has 6 splits. However, after 6 splits, the validation R Square (the red line in Figure 2) starts to decrease. For the validation set, which was not used to build the model, additional splits are not improving our ability to predict the response.

Figure 2: Split History

Taskin Shakib

b. Which variables are the largest contributors? Expected responses are somewhat subjective, but assess the G^2 Entropy (or Information Gain) scores, and make an analyst decision on whether you view a sharp drop-off to finalize your answer.

The largest contributor in my model is Sex (as shown in Figure 3). Since we know that the rule is to split with the highest LogWorth, the model started splitting with sex (LogWorth – 56.584). The G^2 value for the variable sex is also the highest (254.58). The second largest contributor in our model was Passenger Class, with G^2 value of 72.54 and Log Worth of 16.79.

Figure 3: Survived, First Candidate Split

c. What is the validation misclassification rate for this model? Is the model better at predicting survival or non-survival?

The misclassification rate for our validation data is 0.1937, or 19.73%. The numbers behind the misclassification rate can be seen in the confusion matrix (Figure 4). We focus on the misclassification rate and confusion matrix for the validation data. Since these data were not used in building the model, this provides a better indication of how well the model classifies Survived.

Figure 4: Misclassification Rate and Confusion Matrix

Taskin Shakib

There are four possible outcomes in our classification:

A survived passenger is correctly classified as survived. A survived passenger is misclassified as survived A passenger who did not survive is misclassified as survived A passenger who did not survive is correctly classified as did not survive

d. What is the area under the validation ROC curve for Survived? Interpret this value. Does the model do a better job of classifying survival than a random model?

The area under the curve, or AUC (labeled Area in Figure 5) is a measure of how well our model sorts the data. The area under the curve for Survived = Yes is 0.8591 (see Figure 5), indicating that the model predicts better than the random sorting model.

Figure 5: ROC Curve for Survived

Taskin Shakib

e. What is the lift for the model at portion = 0.3 and at portion = 0.5? Interpret these values.

The higher the lift at a given portion, the better our model is at correctly classifying the outcome within this portion.

For Survived = Yes, the lift at Portion = 0.3 is roughly 1.475 (see Figure 6). This means that in rows in the data table that correspond to the top 30% of the model’s predicted probabilities, the number of actual Yes outcomes is 1.475 times higher than we would expect if we had just chosen 15% of the rows from the data set at random.

For Survived = Yes, the lift at Portion = 0.4 is roughly 1.4 (see Figure 6). This means that in rows in the data table that correspond to the top 40% of the model’s predicted probabilities, the number of actual Yes outcomes is 1.4 times higher than we would expect if we had just chosen 40% of the rows from the data set at random.

Figure 6: Lift Curve for Survived

f. Summarize the three “purest” segments with respect to survival as if I were your manager at work, and I wanted a business language summary of the top three groups that the Decision Tree identified. (Hint: Leaf Report can be helpful here by reinterpreting the English rules that the Decision Tree generates into business language)

The three purest segments with respect to survival would be Sex, passenger class and age. If we observe Figure 7, we can see that The highest probability that a passenger had survived is (0.9238), shown in the 2nd row of the leaf report, has three splits: Firstly, gender of the passenger being male, secondly split on age, and last split on the number of family members accompanied by that particular passenger (less than 4).

Taskin Shakib

Here is the interpretation of this leaf, or decision rule: When the sex of a passenger is male, age is less than 11, and siblings and spouses less than 4, the predicted probability that survived = Yes is 0.9238 (and the probability that survived = No is 0.0762).

In business terms in means that, the probability that a passenger survived the Titanic ship sink was 92.38%, if the passenger was a less than 11 years old male with less than 4 siblings accompanying him in the journey. In cases of female passengers, if she belonged to either 1st or 2nd class, she had a 92% chance of surviving. This means for male passengers, age and number of family member were the most important deciding factors. On the other hand, for female passengers the most crucial deciding factor of survival in that crash was the class of ticket she bought or was a part of.

Figure 7: Leaf Report

assignment_week 5_taskin shakib.docx

Documents