data mining question set

DATA MINING FOR BUSINESS INTELLIGENCE PROF. MAYTAL SAAR-TSECHANSKY

Assignment 1 I. Data Mining Concepts

Answer the questions below. Please be concise and phrase your answers carefully.

1. For each of the data mining tasks (i.e., classification, regression, clustering, link analysis and

sequence analysis) provide an example of a business problem that can be supported by these

methods. For each business problem (e.g., customer attrition) formulate clearly the business

goal (e.g., prevent attrition of any customers who is likely to switch to a competitor), and

explain how the data mining method can be used to obtain it (e.g., build a classification

model to predict whether a customer will switch). Please do not use the examples discussed

in class. (10 points)

2. Because descriptive data mining is not used to predict values of interest, it is beneficial to

gain insights on past events, but is not useful to support future decisions. Do you agree with

this statement? Explain your answer. (15 points)

3. An analyst in a telecommunication company analyzed a subset of the firms data and found

that 20% of customers who made at least two calls to the companys customer service center

within 2 months have switch to a competing provider. Later that day, the analyst repeated

the analysis, using another subset from the same database. However, this time the analysis

suggested that only 10% of customers switched once they made at least two calls to

customer service within a period of two months. Suggest possible reason(s) for the

discrepancy between the patterns the analyst found in each case. (15 points)


II. Classification Trees

Note: This part of the assignment can be prepared electronically. However, if you prefer to use paper and pencil, you may submit a hard copy instead.

Classification trees are one of the most widely used data mining algorithms; they

are simple yet effective. To get started, consider the problem of predicting

whether or not the new president goes jogging on a particular day. You have

observed the presidents decisions in the past and constructed a training set of

historical examples including the weather in a given day, whether the president

jogged the previous day, and then the presidents jogging decision on that day.

You now want to generate a predictive model to predict the presidents decisions

in the future.

The values that the different attributes can take are provided below:

Attribute Possible Values WEATHER Warm, Cold, Raining JOGGED_YESTERDAY Yes, No Jog today (target variable) Yes, No

Because each attribute's value starts with a different letter, for shorthand we'll just use that initial letter, e.g., 'W' for Warm.. Our target/class variable (the variable value we want to predict) is whether or not the president will jog today. Here is our TRAINING data set, which we will use to build a predictive model of the presidents decisions:


WEATHER JOGGED_YESTERDAY Target (Jog Today) C N Yes W Y No R Y No C Y No R N No W Y No C N No W N Yes C Y No W Y Yes W N Yes C N Yes R Y No W Y No

(a) Constructing the Initial Decision Tree (25 points) Apply the classification tree building steps described in class (and in Chapter 6 of the text) to the TRAINING set, using information gain as the criteria for selecting splits to include in the tree. Show all your work, including what splits were considered at each step, all entropy calculations used to decide among alternative splits, and the final decision tree model you constructed. As a reminder of the class discussion, recall that the process for selecting the best split at each node, can be simplified into one simple rule: For each node in the tree, split the examples based on the attribute that produces the largest information gain (formula provided in class notes). Your initial tree is just a single leaf node, containing all the examples in the training data. For, each leaf node, consider all available attributes as candidates for splits, and calculate the information gain obtained in each case. Then specify which of the possible splits/attributes provides the largest information gain and incorporate this split in your tree model. If multiple attributes tie for the best one, choose the one whose name appears earliest in alphabetical order (e.g., JOGGED_YESTERDAY before WEATHER). Recall that once a split is decided and sub groups (descended nodes in the tree) are created in the tree model, these subgroups themselves also can be split, unless one of the stopping rules applies. See the class notes for examples on how this is done.


This process continues for each new leaf node until either of two conditions is met:

1. All available attributes have already been included along the path through the tree, or

2. the training examples associated with this leaf node all have the same class value ((i.e., if the leaf is already pure).

You may use Excel to calculate the information gain from partitioning the training example on a given attribute. Use the function =log(number, base), where base is 2 for entropy calculation.

*If you prepare this question (a) using a paper and pencil, please make sure your answers can be read by the TA (in other words, type your answers or print neatly).

Please dont spend too much time on any given part. If you get stuck and have questions, please request to meet us (TA and instructor) and we will help you.

(b) Using the tree model for prediction, and estimating models predictive accuracy (15 points) Here is a Test Set of examples for which you would like to generate predictions:

WEATHER JOGGED_YESTERDAY Target (Jog Today) W Y ?

R N ? C N ? C Y ? W N ? R Y ?

Use the decision tree produced in part (a) to predict the class (classify) each example in the TEST set. The table below contains the same examples as in the Test Set, but also includes the correct classification of each example. What proportions of the cases in the Test Set were predicted accurately by your model?


WEATHER JOGGED_YESTERDAY Target (Jog Today) W Y No R N Yes C N Yes C Y No W N Yes R Y Yes

(d) Now assume that you did not record the weather in each day or whether the president jogged the previous day, and thus cannot induce a classification tree model. All you recorded was the last column in your training data whether the president jogged that day. * If you were to use only this information, what would be your best prediction for each of the cases in the Test Set? Explain clearly how did you arrive at this prediction. * Given these predictions, in what proportion of the cases in the Test Set your predictions are correct? *Compare this result with the prediction accuracy of the classification tree model calculated in (b). Is the classification model better at predicting the presidents decisions? (15 points) (e) Generating Rules (5 points) Extract two rules from the decision tree generated in step (a).

data mining question set

Documents