ma mru dm chapter09

Upload: pop-roxana

Post on 15-Oct-2015

22 views

Category:

Documents


0 download

DESCRIPTION

MRU

TRANSCRIPT

  • Chapter 9Market Basket Analysis and

    Association Rules

  • 2Data Mining Techniques So Far

    Chapter 5 Statistics

    Chapter 6 Decision Trees

    Chapter 7 Neural Networks

    Chapter 8 Nearest Neighbor Approaches: Memory-

    Based Reasoning and Collaborative Filtering

  • 3Questions related to Market Basket

  • 4What can be inferred?

    I purchase diapers

    I purchase a new car

    I purchase OTC cough medicine

    I purchase a prescription medication

    I dont show up for class

  • 5Market Basket Analysis

    Retail each customer purchases different set of products, different quantities, different times

    MBA uses this information to: Identify who customers are (not by name) Understand why they make certain purchases Gain insight about its merchandise (products):

    Fast and slow movers Products which are purchased together Products which might benefit from promotion

    Take action: Store layouts Which products to put on specials, promote, coupons

    Combining all of this with a customer loyalty card it becomes even more valuable

  • 6Association Rules

    DM technique most closely allied with Market Basket Analysis

    AR can be automatically generated

    AR represent patterns in the data without a specified target variable

    Good example of undirected data mining

    Whether patterns make sense is up to humanoids (us!)

  • 7Association Rules Apply Elsewhere

    Items purchased on a credit card, such as rental cars and hotel rooms, provide insight into the next product that customers are likely to purchase.

    Optional services purchased by telecommunicationscustomers (call waiting, call forwarding, DSL, speed call, and so on) help determine how to bundle these services together to maximize revenue.

    Banking services used by retail customers (money market accounts, CDs, investment services, car loans, and so on) identify customers likely to want other services.

    Unusual combinations of insurance claims can be a sign of fraud and can spark further investigation.

    Medical patient histories can give indications of likely complications based on certain combinations of treatments.

  • 8Market Basket Analysis Drill-Down

    MBA is a set of techniques, Association Rules being most common, that focus on point-of-sale (p-o-s) transaction data

    3 types of market basket data (p-o-s data)

    Customers

    Orders (basic purchase data or baskets or item sets)

    Items (merchandise/services purchased)

  • 9Typical Data Structure (Relational Database)

    Lots of questions can be answered

    Avg # of orders/customer

    Avg # unique items/order

    Avg # of items/order

    For a product What % of customers have purchased

    Avg # orders/customer include it

    Avg quantity of it purchased/order

    Visualizationis extremelyhelpfulnext slide

    Transaction Data

  • 10

    Combining data

    These measuresgive broad insightinto the business.

    In some cases,there are few repeat customers, so the proportion of orders per customer is close to 1.

    This suggests a business opportunity to increase the number of sales per customers.

    Or, the number of products per order may be close to 1, suggesting an opportunity for cross-selling during the process of making an order.

    It can be useful to compare these measures to each other.

  • 11

    Questions about ...

    Sales Order Characteristics

    Item Popularity

    Tracking Marketing Interventions

    Clustering Products by Usage

  • 12

    Sales Order Characteristics

    Customer purchases have additional interesting characteristics.

    For instance, the average order size varies by time and region

    For Web purchases and mail-order transactions, additional information may also be gathered at the point of sale:

    Did the order use gift wrap?

    Is the order going to the same address as the billing address?

    Did the purchaser acceptor decline a particularcross-sell offer?

  • 13

    Item popularity

    What is the most common item found on a one-item order?

    What is the most common item found on a multi-item order?

    What is the most common item for repeat customer purchases?

    How has ordering of an item changed over time?

    How does the ordering of an item vary geographically?

  • 14

    Tracking Marketing Interventions Including marketing interventions along with the product sales over

    time makes it possible to see the effect of the interventions.

    Prior to the intervention, sales are hovering at 50 units / week.

    After the intervention, they peak at 7-8 times that amount.

    A challenge in answering this question is determining whether the additional sales are incremental or are made by customers who would purchase the product anyway at some later time. We can also look at the

    number of basketscontaining the item.

    If the number of customersis not increasing, there isevidence that existingcustomers are simplystocking up on the item ata lower cost.

  • 15

    Clustering Products by Usage

    What groups of products often appear together? Such groups of products are very useful for making recommendations

    to customerscustomers who have purchased some of the products may be interested in the rest of them

    A lot of information available about products.

    In addition to the product hierarchy, such information includes the color of clothes, whether food is low calorie, whether a poster includes a frame, and so on

    Questions: Do diet products tend to sell together?

    Are customers purchasing similar colors of clothing at the same time?

    Do customers who purchase framed posters also buy other products?

  • 16

    Pivoting for Cluster Algorithms

  • 17

    Association Rules

    Wal-Mart customers who purchase Barbie dolls have a 60% likelihood of also purchasing one of three types of candy bars

    Customers who purchase maintenance agreements are very likely to purchase large appliances

    When a new hardware store opens, one of the most commonly sold items is toilet bowl cleaners

    So what

  • 18

    Famous Rules: Beer & Diapers

  • 19

    Famous Rules: Beer & Diapers

    WHY?

    Beer drinkers do not want to interrupt their enjoyment of televised sports, so they buy diapers to reduce trips to the bathroom. No, thats not it.

    Families with young children are preparing for the weekend.

    What can a retailer do with this information?

    Put the beer and diapers close together, so when one is purchased, customers remember to buy the other one.

    Put them as far apart as possible, so opportunity to buy yet more items.

    Put higher-margin diapers a bit closer to the beer, although mixing baby products and alcohol would probably be unseemly.

  • 20

    Association Rules

    If buy Diaper

    Buy Beer

    Then

    If buy Beer, Diaper

    Buy Cheese,Chocolate

    Then

    Shoppers who buy Diaper are very likely to buy Beer.

    Shoppers who buy Beer and Diaper are likely to buy Cheese and Chocolate

    Examples:

    For a frequent itemset {Diaper, Beer}, is Diaper promoting the purchase of Beer, or Beer increasing the chance of Diaper purchase?

    We need directions.

  • 21

    Association Rules

    Rule format:

    If {set of items} Then {set of items}

    LHS implies RHS *

    If {Diaper, Baby Food}

    {Beer, Wine}

    Then

    LHS RHS

    An association rule is valid if it satisfies some evaluation measures

    * RHS = "Right Hand Side LHS = "Left Hand Side

  • 22

    Association Rules

    Association rule types:

    Actionable Rules contain high-quality, actionable

    information

    Trivial Rules information already well-known by

    those familiar with the business

    Results from market basket analysis may simply be measuring

    the success of previous marketing campaigns

    Inexplicable Rules no explanation and do not

    suggest action

    Trivial and Inexplicable Rules occur most often

  • 23

    Milk & Wine co-occur But

    Only 2 out of 200K transactions contain these items

    Rule Evaluation

    Transaction No. Item 1 Item 2 Item 3

    100 Beer Diaper Chocolate

    101 Milk Chocolate Wine

    102 Beer Wine Vodka

    103 Beer Cheese Diaper

    104 Ice Cream Diaper Beer

    .

  • 24

    Support:

    The frequency in which the items in LHS and RHS co-occur.

    E.g., The support of the {Diaper} {Beer} rule is 3/5:

    60% of the transactions contain both items.

    No. of transactions containing items in LHS and RHS

    Total No. of transactions in the datasetSupport =

    Transaction No. Item 1 Item 2 Item 3

    100 Beer Diaper Chocolate

    101 Milk Chocolate Shampoo

    102 Beer Wine Vodka

    103 Beer Cheese Diaper

    104 Ice Cream Diaper Beer

    Rule Evaluation Support

  • 25

    Rule Evaluation - ConfidenceIs Beer leading to Diaper purchase or Diaper leading to Beer purchase?

    Among the transactions with Diaper, 100% have Beer. P(Beer|Diaper)=100%

    Among the transactions with Beer, 75% have Diaper. P(Diaper|Beer)=75%

    Confidence =

    Transaction No. Item 1 Item 2 Item 3

    100 Beer Diaper Chocolate

    101 Milk Chocolate Shampoo

    102 Beer Wine Vodka

    103 Beer Cheese Diaper

    104 Ice Cream Diaper Beer

    No. of transactions containing both LHS and RHS

    No. of transactions containing LHS

    confidence for {Diaper} {Beer} : 3/3

    When Diaper is purchased, the likelihood of Beer purchase is 100%

    confidence for {Beer} {Diaper} : 3/4

    When Beer is purchased, the likelihood of Diaper purchase is 75%

    So, {Diaper} {Beer} is a more important rule according to confidence.

  • 26

    Rule Evaluation - LiftTransaction No. Item 1 Item 2 Item 3 Item 4

    100 Beer Diaper Chocolate

    101 Milk Chocolate Shampoo

    102 Beer Milk Vodka Chocolate

    103 Beer Milk Diaper Chocolate

    104 Milk Diaper Beer

    Whats the support and confidence for rule {Chocolate}{Milk}?

    Support = 3/5 Confidence = 3/4

    Very high support and confidence. Does Chocolate really lead to Milk purchase?

    No! Because Milk occurs in 4 out of 5 transactions. Chocolate is even decreasing the chance of Milk purchase 3/4 < 4/5, i.e. P(Milk|Chocolate) 1 then the rule is better at predicting the result than guessing

    When lift < 1, the rule is doing worse than informed guessing and using

    the Negative Rule produces a better rule than guessing

  • 27

    Rule Evaluation Lift (cont.)

    Measures how much more likely is the RHS given the LHS than merely the RHS

    Lift = confidence of the rule / probability of the RHS

    i.e. = P(RHS|LHS)/P(RHS)

    Example: {Diaper} {Beer} Total number of customer in database: 1000

    No. of customers buying Diaper: 200

    No. of customers buying beer: 50

    No. of customers buying Diaper & beer: 20

    Probability of Beer = 50/1000 (5%)

    Confidence = 20/200 (10%)

    Lift = 10%/5% = 2

    Lift higher than 1 implies people have higher change to buy Beer when they buy Diaper. Lift lower than 1 implies people have lower change to buy Milk when they buy Chocolate.

  • 28

    Rule Evaluation Practical Impact

    Most methods for extracting association rules find too many trivial rules. Most are either obvious and uninteresting.

    Example: If Maternity Ward then patient is a woman. Confidence 100%, support 100%

    Need to screen for rules that are of particular interest and significance.

    Actionable: Keep only rules that can be acted upon.

    Interestingness: Various measures for how surprising or unexpected a rule is.

    Example: A rule is interesting if it contradicts what is currently known (e.g., it contradicts a rule that was previously discovered).

  • 29

    Creating Association Rules

    1. Choosing the right set of items

    2. Generating rules by deciphering the counts in the co-occurrence matrix

    3. Overcoming the practical limits imposed by thousands or tens of thousands of unique items

  • 30

    Creating Association Rules

  • 31

    Creating Association Rules

    Choosing the right set of items Within a grocery store where there are tens of

    thousands of products on the shelves, a frozen pizza might be considered an item for analysis purposes, regardless of its toppings (extra cheese, pepperoni, or mushrooms), its crust (extra thick, whole wheat, or white), or its size.

    On the other hand, the manager of frozen foods or a chain of pizza restaurants may be very interested in the particular combinations of toppings that are ordered.

  • 32

    Creating Association Rules

    Choosing the right set of items

    What level of the product hierarchy is the right one to use?

    Market basket analysis produces the best results when the items occur in roughly the same number of transactions in the data. This helps prevent rules from being dominated by the most common items. Product hierarchies can help here. Roll up rare items to higher levels in the hierarchy, so they become more frequent. More common items may not have to be rolled up at all.

  • 33

    Creating Association Rules

    Generating rules by deciphering the counts in the co-occurrence matrix

    if condition, then result.

    if Barbie doll, then candy bar

    = if a customer purchases a Barbie doll, then the customer is also expected to purchase a candy bar.

    Saying that the rule if B and C then A has a confidence of 0.33 is equivalent to saying that when B and C appear in a transaction, there is a 33 percent chance that A also appears in it.

  • 34

    Creating Association Rules

    Overcoming the practical limits imposed by thousands or tens of thousands of unique items

    1. Generate co-occurrence matrix for single itemsif OJ then soda

    2. Generate co-occurrence matrix for two itemsif OJ and Milk then soda

    3. Generate co-occurrence matrix for three itemsif OJ and Milk and Window Cleaner then soda

    4. And so on

  • 35

    Algorithm to Extract Association Rules

    The standard algorithm: Apriori Rakesh Agrawal, Ramakrishnan Srikant: Fast Algorithms for Mining Association Rules in Large Databases. VLDB 1994: 487-499

    The Association Rules problem was defined as:

    Generate all association rules that have

    support greater than the user-specified minimum support

    and confidence greater than the user-specified minimum confidence

    the base algorithm uses support and confidence, but we can also use lift to rank the rules discovered by Apriori.

    The algorithm performs an efficient search over the data to find all such rules.

  • 36

    Finding Association Rules from Data

    Association rules discovery problem is decomposed into two sub-problems:

    1. Find all sets of items (itemsets) whose support is above minimum support - called frequent itemsets or large itemsets

    2. From each frequent itemset, generate rules whose confidence is above minimum confidence.

    Given a large itemset Y, and X is a subset of YCalculate confidence of the rule X (Y - X) If its confidence is above the minimum confidence, then X (Y - X) is an association rule we are looking for.

  • 37

    Example

    A data set with 5 transactions

    Minimum support = 40%, Minimum confidence = 80%

    Phase 1: Find all frequent itemsets

    {Beer} (support=80%),

    {Diaper} (60%),

    {Chocolate} (40%)

    {Beer, Diaper} (60%)

    Transaction No. Item 1 Item 2 Item 3

    100 Beer Diaper Chocolate

    101 Milk Chocolate Shampoo

    102 Beer Wine Vodka

    103 Beer Cheese Diaper

    104 Ice Cream Diaper Beer

    Beer Diaper (conf. 34= 75%)

    Diaper Beer (conf. 33= 100%)

    Phase 2:

  • 38

    A naive way is to calculate the support for every possible itemset. 2N

    possible itemsets given N items impossible to do!

    Need smart method: frequent itemsets of size n contain itemsets of size n-1 that also must be frequest

    Example: if {diaper, beer} is frequent then {diaper} and {beer} are each frequent as well

    This means that

    If an itemset is not frequent (e.g., {wine}) then no itemset that includes wine can be frequent either, such as {wine, beer} .

    We therefore first find all itemsets of size 1 that are frequent.

    Then try to expand these by counting the frequency of all itemsets of size 2 that include frequent itemsets of size 1.

    Example:

    If {wine} is not frequent we need not try to find out whether {wine, beer} is frequent. But if both {wine} & {beer} were frequent then it is possible (though not guaranteed) that {wine, beer} is also frequent.

    Then take only itemsets of size 2 that are frequent, and try to expand those, etc.

    Phase 1: Finding all frequent itemsetsHow to perform an efficient search of all frequent itemsets?

  • 39

    Assume {Milk, Bread, Butter} is a frequent itemset.

    Using items contained in the itemset, list all possible rules {Milk} {Bread, Butter} {Bread} {Milk, Butter} {Butter} {Milk, Bread} {Milk, Bread} {Butter} {Milk, Butter} {Bread} {Bread, Butter} {Milk}

    Calculate the confidence of each rule Pick the rules with confidence above the minimum confidence

    Support {Milk, Bread, Butter}Support {Milk}

    No. of transaction that support {Milk, Bread, Butter}No. of transaction that support {Milk}

    =

    Phase 2: Generating Association Rules

    Confidence of {Milk} {Bread, Butter}:

  • 40

    Agrawal (94)s Apriori Algorithm -An Example

    Transactions

    1st scan

    C1 L1

    L2

    C2 C22nd scan

    C3 L33rd scan

    T-ID Items

    10 A, C, D

    20 B, C, E

    30 A, B, C, E

    40 B, E

    Itemset sup

    {A} 2

    {B} 3

    {C} 3

    {D} 1

    {E} 3

    Itemset sup

    {A} 2

    {B} 3

    {C} 3

    {E} 3

    Itemset

    {A, B}

    {A, C}

    {A, E}

    {B, C}

    {B, E}

    {C, E}

    Itemset sup

    {A, B} 1

    {A, C} 2

    {A, E} 1

    {B, C} 2

    {B, E} 3

    {C, E} 2

    Itemset sup

    {A, C} 2

    {B, C} 2

    {B, E} 3

    {C, E} 2

    Itemset

    {B, C, E} Itemset sup

    {B, C, E} 2{A,B,C}, {A, C, E}?

  • 41

    The number of combinations with n items is proportional to the number of items raised to the nth power - a number that gets very large, very fast.

  • 42

    Final Thought on Association Rules:The Problem of Lots of Data

    Fast Food Restaurantcould have 100 items on its menu How many combinations are there with 3 different menu

    items? 161,700 ! Supermarket10,000 or more unique items

    50 million 2-item combinations 100 billion 3-item combinations

    How to reduce data: Use of product hierarchies (groupings) Prunning: reducing the number of items and combinations

    of items being considered at each step Minimum support pruning requires that a rule hold on a minimum

    number of transactions. If there are one million transactions and the minimum support is 1%,

    then only rules supported by 10,000 transactions are of interest.

    Finally, know that the number of transactions in a given time-period could also be huge (hence expensive to analyze)

  • 43

    Using Association Rules to Compare Stores

    EX: compare sales at store openings versus existing stores:

    1. Gather data for a specific period (such as 2 weeks) from store openings.Augment each of the transactions in this data with a virtual item saying that the transaction is from a store opening.

    2. Gather about the same amount of data from existing stores.Here you might use a sample across all existing stores, or you might take all the data from stores in comparable locations.Augment the transactions in this data with a virtual item saying that the transaction is from an existing store.

    3. Apply market basket analysis to find association rules in each set.

    4. Pay particular attention to association rules containing the virtual items.

  • 44

    DissociationRules

    if A and not B, then C

    Dissociation rules can be generated by a simple adaptation of the basic market basket analysis algorithm.

    Downsides to including new items:

    doubling the number of items seriously degrades performance

    the size of a typical transaction grows because it now includes inverted items

    the frequency of the inverse items tends to be much larger than the frequency of the original items.

    So, minimum support constraints tend to produce rules in which all items are inverted, such as if NOT A and NOT B then NOT C.

    These rules are less likely to be actionable.

  • 45

    Sequential Analysis Using Association Rules

    Association rules find things that happen at the same time -what items are purchased at a given time.

    The next natural question concerns sequences of eventsand what they mean. Examples: New homeowners purchase shower curtains before purchasing

    furniture.

    Customers who purchase new lawnmowers are very likely to purchase a new garden hose in the following 6 weeks.

    When a customer goes into a bank branch and asks for an account reconciliation, there is a good chance that he or she will close all his or her accounts.

    In order to consider time-series analyses on your customers, there has to be some way of identifying customers. Without a way of tracking individual customers, there is no way to analyze their behavior over time.

  • 46

    Sequential Patterns

    Instead of finding association between items in a single transactions, find association between items across related transactions over time.

    Customer ID Transaction Data. Item 1 Item 2

    AA 2/2/2001 Laptop Case

    AA 1/13/2002 Wireless network card Router

    BB 4/5/2002 laptop iPaq

    BB 8/10/2002 Wireless network card Router

    Sequence : {Laptop}, {Wireless Card, Router}

    A sequence has to satisfy some predetermined minimum support

  • 47

    Exercise 1 by hand

    Given the above list of transactions, do the following:

    1) Find all the frequent itemsets (minimum support 40%)

    2) Find all the association rules (minimum confidence 70%)

    3) For the discovered association rules, calculate the lift

    Transaction No.Item 1 Item 2 Item 3 Item 4

    100 Beer Diaper Chocolate

    101 Milk Chocolate Shampoo

    102 Beer Soap Vodka

    103 Beer Cheese Wine

    104 Milk Diaper Beer Chocolate

  • 48

    RapidMiner Practice

    To see:

    RapidMiner Tutorial example 2 / 26

    To practice:

    Do the exercise presented in the tutorial using the file Iris.ioo.

  • 49

    Exercise 1 using RapidMiner

    Take Beer.xls file and find the association rules

    First process the data to the right format(Beer1.xls )

  • 50

    RapidMiner Practice

    To see:

    Training Videos\05 - Akhtar Fareed -RapidMinerTutorial\RapidMiner Tutorial (part 9_9) Association Rules

    To practice:

    Do the exercises presented in the movie using the file BalanceScale.xls.

  • 51

    Data Preprocessing

    Bank.xls Bank.ioo

    Save as .ioo format

    Process design Take a look at the .ioo file and attributes / variables

    Process the attributes using Select Attributes Rules can only handle categorical data types

    Find association rules Use operators: FP-Growth then Create Association Rules

    Association Rules

    Read and interpret the results

    RapidMiner Practice