Download - Data Mining Question Set
-
DATA MINING FOR BUSINESS INTELLIGENCE PROF. MAYTAL SAAR-TSECHANSKY
Assignment 1 I. Data Mining Concepts
Answer the questions below. Please be concise and phrase your answers carefully.
1. For each of the data mining tasks (i.e., classification, regression, clustering, link analysis and
sequence analysis) provide an example of a business problem that can be supported by these
methods. For each business problem (e.g., customer attrition) formulate clearly the business
goal (e.g., prevent attrition of any customers who is likely to switch to a competitor), and
explain how the data mining method can be used to obtain it (e.g., build a classification
model to predict whether a customer will switch). Please do not use the examples discussed
in class. (10 points)
2. Because descriptive data mining is not used to predict values of interest, it is beneficial to
gain insights on past events, but is not useful to support future decisions. Do you agree with
this statement? Explain your answer. (15 points)
3. An analyst in a telecommunication company analyzed a subset of the firms data and found
that 20% of customers who made at least two calls to the companys customer service center
within 2 months have switch to a competing provider. Later that day, the analyst repeated
the analysis, using another subset from the same database. However, this time the analysis
suggested that only 10% of customers switched once they made at least two calls to
customer service within a period of two months. Suggest possible reason(s) for the
discrepancy between the patterns the analyst found in each case. (15 points)
-
DATA MINING FOR BUSINESS INTELLIGENCE PROF. MAYTAL SAAR-TSECHANSKY
II. Classification Trees
Note: This part of the assignment can be prepared electronically. However, if you prefer to use paper and pencil, you may submit a hard copy instead.
Classification trees are one of the most widely used data mining algorithms; they
are simple yet effective. To get started, consider the problem of predicting
whether or not the new president goes jogging on a particular day. You have
observed the presidents decisions in the past and constructed a training set of
historical examples including the weather in a given day, whether the president
jogged the previous day, and then the presidents jogging decision on that day.
You now want to generate a predictive model to predict the presidents decisions
in the future.
The values that the different attributes can take are provided below:
Attribute Possible Values WEATHER Warm, Cold, Raining JOGGED_YESTERDAY Yes, No Jog today (target variable) Yes, No
Because each attribute's value starts with a different letter, for shorthand we'll just use that initial letter, e.g., 'W' for Warm.. Our target/class variable (the variable value we want to predict) is whether or not the president will jog today. Here is our TRAINING data set, which we will use to build a predictive model of the presidents decisions:
-
DATA MINING FOR BUSINESS INTELLIGENCE PROF. MAYTAL SAAR-TSECHANSKY
WEATHER JOGGED_YESTERDAY Target (Jog Today) C N Yes W Y No R Y No C Y No R N No W Y No C N No W N Yes C Y No W Y Yes W N Yes C N Yes R Y No W Y No
(a) Constructing the Initial Decision Tree (25 points) Apply the classification tree building steps described in class (and in Chapter 6 of the text) to the TRAINING set, using information gain as the criteria for selecting splits to include in the tree. Show all your work, including what splits were considered at each step, all entropy calculations used to decide among alternative splits, and the final decision tree model you constructed. As a reminder of the class discussion, recall that the process for selecting the best split at each node, can be simplified into one simple rule: For each node in the tree, split the examples based on the attribute that produces the largest information gain (formula provided in class notes). Your initial tree is just a single leaf node, containing all the examples in the training data. For, each leaf node, consider all available attributes as candidates for splits, and calculate the information gain obtained in each case. Then specify which of the possible splits/attributes provides the largest information gain and incorporate this split in your tree model. If multiple attributes tie for the best one, choose the one whose name appears earliest in alphabetical order (e.g., JOGGED_YESTERDAY before WEATHER). Recall that once a split is decided and sub groups (descended nodes in the tree) are created in the tree model, these subgroups themselves also can be split, unless one of the stopping rules applies. See the class notes for examples on how this is done.
-
DATA MINING FOR BUSINESS INTELLIGENCE PROF. MAYTAL SAAR-TSECHANSKY
This process continues for each new leaf node until either of two conditions is met:
1. All available attributes have already been included along the path through the tree, or
2. the training examples associated with this leaf node all have the same class value ((i.e., if the leaf is already pure).
You may use Excel to calculate the information gain from partitioning the training example on a given attribute. Use the function =log(number, base), where base is 2 for entropy calculation.
*If you prepare this question (a) using a paper and pencil, please make sure your answers can be read by the TA (in other words, type your answers or print neatly).
Please dont spend too much time on any given part. If you get stuck and have questions, please request to meet us (TA and instructor) and we will help you.
(b) Using the tree model for prediction, and estimating models predictive accuracy (15 points) Here is a Test Set of examples for which you would like to generate predictions:
WEATHER JOGGED_YESTERDAY Target (Jog Today) W Y ?
R N ? C N ? C Y ? W N ? R Y ?
Use the decision tree produced in part (a) to predict the class (classify) each example in the TEST set. The table below contains the same examples as in the Test Set, but also includes the correct classification of each example. What proportions of the cases in the Test Set were predicted accurately by your model?
-
DATA MINING FOR BUSINESS INTELLIGENCE PROF. MAYTAL SAAR-TSECHANSKY
WEATHER JOGGED_YESTERDAY Target (Jog Today) W Y No R N Yes C N Yes C Y No W N Yes R Y Yes
(d) Now assume that you did not record the weather in each day or whether the president jogged the previous day, and thus cannot induce a classification tree model. All you recorded was the last column in your training data whether the president jogged that day. * If you were to use only this information, what would be your best prediction for each of the cases in the Test Set? Explain clearly how did you arrive at this prediction. * Given these predictions, in what proportion of the cases in the Test Set your predictions are correct? *Compare this result with the prediction accuracy of the classification tree model calculated in (b). Is the classification model better at predicting the presidents decisions? (15 points) (e) Generating Rules (5 points) Extract two rules from the decision tree generated in step (a).