business intelligence name: tutorial for rapid miner...

Tutorial for RapidMiner – Advanced Tree and CRISP-DM Model with Market Segmentation; Page-1

Business Intelligence NAME: _______________

Professor Chen Due Date: _____________

Tutorial for Rapid Miner

(Advanced Decision Tree and CRISP-DM Model with an example of Market Segmentation*)

Tutorial Summary Objective: Richard would like to figure out which customers he could expect to buy the new eReader and on what time

schedule, based on the company’s last release of a high-profile digital reader.

How:

The decision tree has enabled him to predict that and to determine how reliable the predictions are. He has also

been able to determine which attributes are the most predictive of eReader adoption, and to find greater

granularity (by switching Decision Tree’s criteria from “gain_ratio” to “gini_index”

PART I. Decision Trees with Two Datasets

1. Import two csv data set (training and scoring datasets)

a. Be sure change from Semi-colon to Comma (“,”)

2. Training dat aset with an additional attribute (i.e., eReader_Adoption) in comparison with Scoring data set.

3. Add “Set Role” operator

a. Add a “Set Role” operator to both datasets and change from “User_ID” to “id”

b. Add another “Set role” operator to Training dataset and change the attribute of “eReader_Adoption” to

“label” (target attribute)

4. Add “Decision Tree” operator to Training dataset. Run it and obtain the solution of Graphic View (with

three good predictors –nodes with gray oval shapes; and four label attributes – leaves with multicolored end

points).

PART II. MARKET SEGMENTATION by Combining Two DataSets

1. Add a new “Apply Model” operator and combining two datasets

2. Run again.

3. The solution of Graphic View remains the same; however, other results are obtained – tree has been applied

to the scoring data. It means that confidence attributes have been added.

4. From the Meta Data View: with logistic regression, four (4) confidence attributes have been created by

RapidMiner, along with a prediction attribute (i.e., eReader_Adoption) (Fig. 10-a on the Tutorial or Figure

10-9 in the North’s text)

5. Switch to Data View and examine Row. 14:

a. RapidMiner is very (but not 100%) convinced that person 77373 (Row 14, Fig. 10-10) is going to be a

member of the “early majority (88.9%).

b. Despite some uncertainly, RapidMiner is completely sure that this person is not going to be an early

adopter (0%).

* Information is from Data Mining for the Masses by Matthew North (chapter 10)


CRISP-DM Methodology

• CRISP-DM stands for cross-industry process for data mining. The CRISP-DM methodology provides a

structured approach to planning a data mining project. It is a robust and well-proven methodology.

• We are evangelists of its powerful practicality, its flexibility and its usefulness when using analytics to

solve thorny business issues. It is the golden thread than runs through almost every client engagement.

The CRISP-DM model is shown below.

Six Phases of CRISP-DM Model

We will explore this data mining example (with market segmentation) using CRIPS-DM methodology.

[1] [2]

[3]

[4]

[5]

[6]


Phase 1. Business Scenario (Business Understanding)

• Richard works for a large online retailer. His company is launching a next-generation eReader soon, and

they want to maximize the effectiveness of their marketing. They have many customers, some of whom

purchased one of the company’s previous generation digital readers. Richard has noticed that certain

types of people were the most anxious to get the previous generation device, while other folks seemed to

content to wait to buy the electronic gadget later. He’s wondering what makes some people motivated to

buy something as soon as it comes out, while others are less driven to have the product.

• Richard’s employer helps to drive the sales of its new eReader by offering specific products and

services for the eReader through its massive web site - for example, eReader owners can use the

company’s web site to buy digital magazines, newspapers, books, music, and so forth.

• The company also sells thousands of other types of media, such as traditional printed books and

electronics of every kind. Richard believes that by mining the customers’ data regarding general

consumer behaviors on the web site, he’ll be able to figure out which customers will buy the new

eReader early, which ones will buy next, and which ones will buy later on.

• He hopes that by predicting when a customer will be ready to buy the next-gen eReader, he’ll be able to

time his target marketing (i.e., market segmentation) to the people most ready to respond to

advertisements and promotions.

Organizational Understanding and Diffusion of Innovation Theory **

Diffusion of Innovation Theory

Definition: The process by which an innovation is communicated through certain channels over time among the

members of a social system.

Four elements:

• Innovation: an idea, practice, or object that is perceived as new by an individual or other unit of adoption.

• Communication channels: the means by which messages get from one individual to another.

• Time: a) innovation-decision process

b) relative time with which an innovation is adopted

c) innovation’s rate of adoption

Social system: a set of interrelated units that are engaged in joint problem solving to accomplish a common goal.

** By Everett Rogers in his book “Diffusion of Innovations.”

The diffusion of innovations according to Rogers. With successive groups of consumers adopting the new technology

(shown in blue), its market share (yellow) will eventually reach the saturation level. In mathematics the S curve is known as

the logistic function.

(Late

Adopters)

Numbers of adopters by group.

Cumulative number of adopters

over time.

Join

when it

is new

Join when

they

perceive a

benefit

Join when there

is a productivity

gain

Join when there

is a plenty of help

and support

Join

when

they have

to

Critical Mass


Phase 2. Data Understanding

Two Data Sets are listed below:

1. Scoring Data Set

– Online_Retailer_DataSet_Scoring.csv

2. Training Data Set

– Online_Retailer_DataSet_Training.csv

Data file: Online_Retailer_DataSet_Scoring.csv (473 records)

Data file: Online_Retailer_DataSet_Training.csv (661 records)

Q: What is the difference (attributes) between these two data sets?

Four possible values of eReader_Adoption:

[1] Innovator: purchase within the 1st week

[2] Early Adopter: purchase within 2 or 3 weeks

[3] Early Majority: purchase > 3weeks but <= 2 months

[4] Late Majority: purchase after the first two months


Meta Data on Data Sets

Business Scenario – Summary and Goal

• Richard has a list of customers and their probable adoption timings for the next-gen eReader. These

customers are identifiable by the User_ID that was retained in the results perspective data but not used as a

predictor in the model.

• Goal:

– He wants to ________ these customers and begin a process of target marketing that is timely and

relevant to each individual.

• The criteria and suggestions for segmenting customers are as follows:

[1] Those who are most likely to purchase immediately (predicted innovators) can be contacted and

encouraged to go ahead and buy as soon as the new product comes out. They may even want the option

to pre-order the new device.

On the other hand, perhaps very little marketing is needed to the predicted innovators, since

they are predicted to be the most likely to buy the eReader in the first place.

[2] Those who are probably likely to purchase earlier (often are opinion leaders, predicted early adopter)

should be …

[3] Those who are less likely (predicted early majority) might need some persuasion, perhaps a free

digital book or two with eReader purchase or a discount on digital music playable on the new eReader.

[4] The least likely (predicted late majority), can be marketed to passively, or perhaps not at all if

marketing budgets are tight and those dollars need to be spent incentivizing the most likely customers to

buy.


Phase 3. Data Preparation - Steps to Achieve the Goal

• Step A: Decision Tree

• Step B: Market Segmentation and HOW? (details please see Phase 4)

Step A: Decision Trees

• Decision trees are excellent predictive models when the target attributes is categorical in nature (e.g.,

eRader_Adoption with five possible values), and when the data set is of mixed types.

• In some cases, decision trees are better than more statistics-based approaches at handling attributes that

have missing or inconsistent values that are not handled as decision trees will work around such data

and still generate usable results.

Decision Trees – with Training and Scoring Data Sets

• Decision trees are made of nodes and leaves (connected by labeled branch arrows), representing the best

predictor attributes in a data set.

• The nodes and leaves lead to confidence percentages in the training data set, and can then be applied to

similarly structured scoring data in order to generate predictions for the scoring observation.

• Decision tress provides us:

a) What is predicted,

b) How confident we can be in the prediction, and

c) How we arrived at the prediction? shown in a graphical view.

1. Import the first data set (i.e., Online_Retailer_DataSet_Training.csv) into RapidMiner repository as shown in

Fig. 1-a thru Fig. 1-g.

a) Import Data Read CSV (since they both are *.csv files)

b) Read CSV operator is created (by double click or Drag and drop).

c) Make sure on Step 2 (of 4) change from Semicolon “;” to Comma “,” on Column Separation box

since the file is with csv format.

d) Change the name of the operator (Read CSV) to “Training” as shown in Fig 1-f and 1-g.

Fig 1-a


Fig 1-b

Fig 1-c

Fig 1-d


Fig 1-e

Fig 1-f

Fig 1-g


2. Repeat the step to import the second data set (i.e., Online_Retailer_DataSet_Scoring.csv) into RapidMiner

repository and the result is shown in Fig. 2

Fig 2

3. RUN the model to examine the data and familiarize with the attributes.

Hint: make sure connect the “out” port to the “res” port on the process area; otherwise, the result can’t

be produced.

Fig 3

4. Data View for the Two Data Sets are illustrated in Fig 4-a and 4-b

Fig 4-a


Fig 4-b

5. SAVE the two data sets and process into the “Local Repository” as shown in Fig 5-a and Fig 5-b. Why?

a) Return to “Design” perspective

b) Click on “Training” operator

c) Select “File” then “Save Process As”

d) Click “data” under “Local Repository” on the Repository Browser box

d) Enter “Online Retailer Training Data Set”

e) Repeat the same step for entering another data set of “Online Retailer Scoring Data Set”

Note that saving the data or process is under same option of “Save Process As” you then select one of the choice

of “data” or “process”

Fig 5-a


Fig 5-b

6. Finally SAVE the process into the Local Repository (Fig 6)

a) Return to “Design” perspective

b) Click anywhere on the “Process” area

c) Select “File” then “Save Process As”

d) Select “process” under “Local Repository” on the Repository Browser box

e) Enter “Online Retailer eReader Adoption Project”

Fig 6

7. Our first goal (Step A) is to find a solution using “Decision Tree” as shown in Fig 7-a to Fig 7-c.

a) While there are no missing or apparently inconsistent values in the data set, there is still some data

preparation yet to do. First of all, the User_ID is an arbitrarily assigned value for each customer. Usesr_ID

should not be included in the model as an independent variable.

b) Rather than removing the attribute (User_ID) we will try a new way of handling a non-predictive attribute.

This is accomplished using the Set Role operator. Using the search field in the Operators tab, find and add Set

Role operators to both your training and scoring streams. Be sure to re-connect the ports from Set Role

operators to “res”.

c) In the Parameters area on the right hand side of the screen, set the role of the User_ID attribute to ‘id’. This

will leave the attribute in the data set throughout the model, but it won’t consider the attribute as a predictor


for the label attribute. Be sure to do this for both the training and scoring data sets, since the User_ID attribute

is found in both of them.

Fig 7-a

Fig 7-b

Fig 7-c

7. cont.

d) One of the nice side-effects of setting an attribute’s role to ‘id’ rather than removing it using a Select

Attributes operator is that it makes each record easier to match back to individual people later, when viewing

predictions in results perspective.

e) Before adding a Decision Tree operator, we still need to do another data preparation step by adding another

Set Role operator. The Decision Tree operator, as with other predictive model operators we’ve used to this

point, expects the training stream to supply a ‘label’ attribute. For this example, we want to predict which

adopter group Richard’s next-gen eReader customers are likely to be in. So our label will be

eReader_Adoption and it should be “reset” in the Set Role operator (Fig 7-d)

f) Next add ‘Decision Tree” operator to your training stream as it is in Figure 7-e

g) Finally, Run the model and switch to the Tree (Decision Tree) tab in results perspective. You will see our

preliminary tree in Figure 8-a.


Fig 7-d

Fig 7-e

8. Interpretation of the Decision Tree view

• In the Decision Tree view (Figure 8-a) we can see what are referred to as nodes and leaves. The nodes are

the gray oval shapes. They are attributes which serve as good predictors for our label attribute. The leaves

are the multicolored end points that show us the distribution of categories from the label attribute that follow

the branch of the tree to the point of that leaf. We can see in this tree that Website_Activity (on the top of the

decision tree) is our best predictor of whether or not a customer is going to adopt (buy) the company’s new

eReader. If the person’s activity is frequent or regular, we see that they are likely to be an Innovator or

Early Adopter, respectively.

• If however, they seldom use the web site, then whether or not they’ve bought digital books becomes the

next best predictor of their eReader adoption category. If they have not bought digital books through the

web site in the past, Age is another predictive attribute which forms a node, with younger folks adopting

sooner than older ones. This is seen on the branches for the two leaves coming from the Age node in Fig 8-a.

• Those who seldom use the company’s website, have never bought digital books on the site, and are older

than 25 ½ are most likely to land in the Late Majority category, while those with the same profile but are

under 25 ½ are bumped to the Early Majority prediction. In this example you can see how you read the

nodes, leaves and branch labels as you move down through the tree.

Fig 8-a


8. Cont.

What information is still missed in Fig 8-b? This implies that we need to move to Phase 4.

Fig 8-b

Phase 4. Modeling

Step B of achieving the Goal of Market Segmentation

1. With our predictor attributes prepared, we are now ready to move on to Step 4, Modeling. Save the process if

needed.

2. Return to design perspective. In the Operators tab search for and add an Apply Model operator, bringing the

training and scoring streams together. Ensure that both the lab (labelled attribute) and mod ports are connected

to res ports in order to generate our desired outputs (Figure 9). Furthermore, exa (example) port in Set Role (2)

operator and unl (unlabelled) port in Apply Model should be reconnected.

Fig 9

3. Run the model. You will see familiar results—the tree remains the same, for now. Click on the ExampleSet

tab next to the Tree tab. Our tree has been applied to our scoring data. As was the case with logistic regression,

confidence attributes (shown in Meta Data View) have been created by RapidMiner, along with a prediction

attribute (Figure 10-a and 10-b)


Fig 10-a

Fig 10-b

Phase 5. Evaluation

1. Switch to Data View using the radio button. We see in Figure 10-b the prediction for each customer’s

adoption group, along with confidence percentages for each prediction. There are four confidence attributes,

corresponding to the four possible values in the label (eReader_Adoption). We interpret these the same way that

we did with the other models though - the percentages add to 100%, and the prediction is whichever category

yielded the highest confidence percentage. RapidMiner is very (but not 100%) convinced that person 77373

(Row 14, Figure 11) is going to be a member of the early majority (88.9%). Despite some uncertainty,

RapidMiner is completely sure that this person is not going to be an early adopter (0%).

• Row no. 14 can be interpreted as: person 77373 is going to be a member of the early majority (88.9%).

Despite some uncertainty, RapidMiner is completely sure that this person is not going to be an early adopter

(0%).

Question: What is the business “implication”?

Answer: _______ ____________ • However, other persons are difficult to categorize accurately which segmentation they belong to? Why?

• How to resolve this issue? See further process after ‘Deployment’.


Fig 11

Phase 6. Deployment

• Richard’s original desire was to be able to figure out which customers he could expect to buy the new

eReader and on what time schedule, based on the company’s last release of a high-profile digital reader.

• The decision tree has enabled him to predict that and to determine how reliable the predictions are and the

likelihood of buying for each group. He’s also been able to determine which attributes are the most

predictive of eReader adoption.

• In order to produce better results for market segmentation we need to find greater detail, or greater

granularity in the model by using gini_index as the tree’s underlying algorithm.

Revisiting … We are now re-visiting the following phases:

• Phase 4 Modeling

• Phase 5 Evaluation

• Phase 6 Deployment

Phase 4. Modeling – Revisiting

1. Remember that CRISP-DM is cyclical in nature, and that in some modeling techniques, especially those with

less structured data, some back and forth trial-and-error can reveal more interesting patterns in data.

2. Switch back to design perspective, click on the Decision Tree operator, and in the Parameters area, change the

‘criterion’ parameter from ‘gain_ratio’ to ‘gini_index’, as shown in Figure 12.

3. Re-RUN the model.

Fig. 12


Phase 5. Evaluation – Revisiting

1. We see in this tree (Figure 13-a) that there is much more detail, more granularity in using the Gini algorithm

as our parameter for our decision tree. We could further modify the tree by going back to design view and

changing the minimum number of items to form a node (size for split) or the minimum size for a leaf. Even

accepting the defaults for those parameters though, we can see that the Gini algorithm alone is much more

sensitive than is the Gain Ratio algorithm in identifying nodes and leaves.

2. Take a minute to explore around this new tree model. We will find that it is extensive, and that we will to use

both the Zoom and Mode tools to see it all. We should find that most of our other independent variables

(predictor attributes) are now being used, and the granularity with which Richard can identify each customer’s

likely adoption category is much greater.

3. How active the person is on Richard’s employer’s web site is still the single best predictor, but gender, and

multiple levels of age have now also come into play. We will also find that a single attribute is sometimes used

more than once in a single branch of the tree. Decision trees are a lot of fun to experiment with, and with a

sensitive algorithm like Gini generating them, they can be tremendously interesting as well.

Fig. 13-a

4. Switch to the ExampleSet tab in Data View. We see here (Figure 13-b) that changing our tree’s underlying

algorithm has, in some cases, also changed our confidence in the prediction.

5. Remember in the previous discussion that most of persons other than 77373 (Row.14) are difficult to

categorize accurately which segmentation they belong to since many of them (e.g., those are in ‘Early

Adopter’) are calculated as having at least some percentage chance of landing in any one of the four adopter

categories.

Fig. 13-b


• Interpretation: Let’s take the person on Row 1 (ID 56031) as an example.

• Under the Gain Ratio algorithm, we were 41% sure he’d be an early adopter, but almost 32% sure he

might also turn out to be an innovator.

• In other words, we feel confident he’ll buy the eReader early on, but we’re not sure how early.

Gain_Ratio algorithm

Fig. 14-a

• Maybe that matters to Richard, maybe not. However, he will have to decide during the deployment

phase. But perhaps using Gieni_Index, we can help him decide.

Gini_Index algorithm

Fig. 14-b

• In Figure 14-b, this same man is now shown to have a 60% chance of being an early adopter and only a

20% chance of being an innovator. The odds of him becoming part of the late majority crowd under the

Gini model have dropped to zero.

• We know he will adopt (or at least we are predicting with 100% confidence that he will adopt), and that

he will adopt early. Why?

• While he may not be at the top of Richard’s list when deployment rolls around, he’ll probably be higher

than he otherwise would have been under gain_ratio.

• Note that while Gini has changed some of our predictions, it hasn’t affected all of them. Re-check

person ID 77373 briefly. There is no difference in this person’s predictions under either algorithm -

RapidMiner is quite certain in its predictions for this young man.

Phase 6. Deployment – Revisiting

• Richard now has a tree that shows him which attributes matter most in determining the likelihood of

buying for each group.

• New marketing campaigns can use this information to focus more on increasing web site activity level,

or on connecting general electronics that are for sale on the company’s web site with the eReaders and

digital media more specifically.

• These types of cross-categorical promotions can be further honed to appeal to buyers of a specific

gender or in a given age range.

• Richard has much that he can use in this rich data mining output as he works to promote the next-gen

eReader.

business intelligence name: tutorial for rapid miner...

Documents