knowledge engineering report: apriori algorithm

8/13/2019 Knowledge Engineering Report: Apriori Algorithm

1/13

Hanoi University of Science and TechnologySchool of Information and Communication Technology

==================*=================

Knowledge Engineering ReportSubject: Determination of products bought together

Supervisor: Ph.D. Quang Nhat NGUYENGroup 20: - T Quc Vit (20093262)

- Ng Ngc Thnh (20092421)

Hanoi Dec 2013


2/13


3/13

Knowledge Engineering Report Group 20

Determination of Products Bought Together 3

I. Problem DefinitionAs a vendor, information about which products are frequently purchased

serially after other products would be very helpful in the business. Byexploiting the transaction history, the vendor can obtained the knowledgeabout purchase behavior of the customers. But the benefits is not stay withthe vendor only, it helps the customers to buy the right sets of products thatrelevant to their need.

Figure 1 Purposes & BenefitsTo exploit the raw transaction data, an influential method is the Association

Rule Mining, in the scope of this report, we will introduce to you the basicanalysis of this method.

II. Basic Concepts

Figure 2 Overview

II.1. Data MiningGenerally, data mining (sometimes called data or knowledge discovery) is

the process of analyzing data from different perspectives and summarizing itinto useful information - information that can be used to increase revenue,cuts costs, or both. Data mining software is one of a number of analytical toolsfor analyzing data. It allows users to analyze data from many different

dimensions or angles, categorize it, and summarize the relationshipsidentified. Technically, data mining is the process of finding correlations orpatterns among dozens of fields in large relational databases [1]. The overall

Purposes& Benefits

User Employer

Data mining

Association rule

Apriori Algorithms


4/13



goal of the data mining process is to extract information from a data set andtransform it into an understandable structure for further use [2].

II.2. Association Rule MiningSo, to solve the data mining problem there are several method, one

common approach is Association Rule Data Mining. By definition it is thefrequent pattern mining searches for recurring relationships in a given data

set [3]. The output is the discovery of association and correlations amongitems in large transactional or relational data sets. Those exploited correlationrelationships among available items can help the business decision makingprocess such as: catalog design, cross-marketing and especially customershopping behavior analysis. In this section we will give a detail explanationand analysis on Association Rule Mining.

II.2.1. Basic TermsTo represent the Association Rule Mining, its necessary to understands

several terminologies and be familiar with some notations:

- = { , , , } is the set of all possible items (products to bebought).

- D is the set of transactions in the database where eachtransaction = { , }. We assign an identifier TID foreach transaction.

- Let Pbe a set of items, a transaction Tcontains P .- An association rule is , = .

In explanation, it is the relationships between two disjointitemsets P and Q in I, which imply that if P occurs, Q also occurs in

a transaction T with a certain probability.- The rule holds in the transaction set D with support s,

where sis the percentage of transactions in D that contain ,this value is the very probability( ).

- The rule has confidence cin the transaction set D, where cis the percentage of transactions in D containing in P that alsocontaining Q. That means it is the conditional probability( |).This value can also be computed by the conditional probability

property: ( ) = (|) =()

()

. This

equation makes it much easier to compute the confidence value.- We have to define two threshold minimum support (min_sup) and

minimum confidence (min_conf). By convention, this two value isfrom 0%-100% rather than 0-1.0.

- Rules that satisfy both of above thresholds are called strong rules.o We see that the confidence of rule can be easily

derived from the support counts of Pand( ). Thesetwo value is easy to derive, therefore decrease the

computing time of the process.- The set of k item is denoted as k-itemset.


5/13



- The k-itemset which has minimum support is denoted by Li. TheCkis the set that was generated by joing Lk-1with itself.

Those are all the concepts that follow the algorithm. In the next sections, wewill see how the algorithm are built up.

II.2.2. Base TheoryGenerally, Association Rule Mining has two main steps bellows:

Step 1: Find all frequent itemsets: Each of these itemsets will occurs atleast as frequently as a predetermined minimum support count, min_sup.

Step 2: Generate strong association rules from the frequent itemsets:These rules must satisfy minimum support and minimum confidence.

Seeing the figure below, you can somehow have an idea about how min_supand min_confwork in the Association Rule Mining:

Figure 3 Work flow of the process

In the Step 1, we will apply the Apriori algorithm to find the frequent

itemsets, which is the most important component of the application.Step 2 is the generating Association Rules from the frequent itemsets

obtained from the step 1.So the skeleton of the process is not very complicated. Up to now we have

covered the main idea of the Association Rule Mining process. In the next twosections, the detailed analysis of the two steps of the process will be revealed.

II.2.3.Apriori Algorithm1Apriori algorithm is a seminal algorithm proposed by R. Agrawal and R.

Srikant in 1994 for mining frequent itemsets for Boolean association rules [4].

For reminding, please ensure that you remember two notations Lk and Ckfrom the Basic Termssection before continue to explore the algorithm.

This is the pseudo code of the Apriori algorithm:

1The name of the algorithm is based on the fact that the algorithm usesprior knowledgeof frequent itemsetproperties.

min_confmin_sup

1. Frequentitemsets

2. Associationrules


6/13



List 1 Apriori pseudo code

The idea seems good, but actually, it face a big problem in performance. Theproblem is the step generating the candidate set can produce too muchcandidates that actually not necessary, if we implement it in a trivial approach,then the amount of the generated candidates is enormous, together with thehuge computation resource consuming.

Now here comes the Apriori property: All nonempty subsets of a frequent

itemset must also be frequent. This property is based on the observationthat with a not-frequent itemset l, if an itemAis added to l, then the resultingitemset cannot occur more frequently than l. Hence, is not frequenteither. This property belong to a special category of properties calledantimonotonicity in the sense that if a set cannot pass a test, all of its supersetswill fail the same test as well. Its called antimonotonicity because the propertyis monotonic in the context of failing a test.

We will use is property in the prune step: any (k-1)-itemset that is notfrequent cannot be a subset of a frequent k-itemset. Therefore, if any (k-1)-

subset of a candidate k-itemset is not in Lk-1, then the candidate cannot befrequent either and so can be removed from Ck.II.2.4. Rule Generation

After the step 1 with the Apriori Algorithm, the remaining work is not muchleft. From the frequent itemsets, we can generate strong association rules asfollowing:

List 2 Rule Generation pseudo code

Because the rules are generated from the frequent itemsets, each oneautomatically satisfies the min_sup. So now finally we have produced theassociation rules we need.

= { }for( = 1 ; ! = ; + + )do begin

= ;foreachtransaction T in database do

Increment the count of all candidates in Ck+1that are contained in T = min _end

return ;

foreach generate all nonempty subset of l.

foreach

output the rule ( )

()

() _.


7/13



III.SolutionIII.1. System Architecture

Figure 4 System architecture

The system will act as below:- Accept three input parameters:

o A customer transaction history fileo Min supporto Min confident

-

Use customer transaction history file and min support, apply theApriori algorithm to find all frequent item sets those satisfy themin support.

- Use the set of all frequent item sets and the min confident as input,apply the rule generation process to generate all associated rules,which will be used to determine the products that are boughttogether.

III.2. Knowledge RepresentationThe dataset we use is a dataset of transaction from a Belgian store. It is

downloaded from http://fimi.ua.ac.be/data/retail.dat. Each record willcontain a set of products id, which shows us one transaction of a customer (alist of things that he (she) bought together).
http://fimi.ua.ac.be/data/retail.dathttp://fimi.ua.ac.be/data/retail.dat


8/13



Customer history file: Each line of this file will be a set of product id, whichdefines one transaction that a customer has made before. Between twoproduct ids are a space character. There may be empty line between eachtransaction. We foresee this case and avoid this in the implementation

Min support and min confident can be any double value between 0 and 100.The associated rules will be stored in a file as the output. Each line is a rule,

which has the form of A => {B}.III.3. Implementation

III.3.1.Packages

Figure 5 Packages

The program is divided into three main packages:- GUI package: contains all classes those are responsible for

providing user interfaces- Algorithm package: contains all classes those are belong to the

implementation of the algorithms, which are used in this program.- Program package: contains class that stores the main method. The

program will be run from here.Besides, there will be two packages which store the input and output files of

the system. The resource package stores the user transaction history file. Theoutput package stores the output files, which contains all association rulesgenerated by this system

III.3.2.Class diagram


9/13



Figure 6 Class diagram

All those classes: ItemSet, FrequentItemSet, Candidate, Rule are just thesimple POJO, which are used in programming to store data. All procedures ofthe system, algorithms, are implemented in the Service class.

We will give you some explanation about the methods implemented in thisclass by focusing on each step of the system.

- Step 1: Find all frequent item setso This step, we will implement all method that applies the

Apriori algorithm to find the frequent item sets.Method Description

getConfigValue(String path) - Accept the file path as input parameter- Read through the file and calculate the number

of transaction and number of items in thedataset.

createFirstCandidate() - Get the first candidate list by statistic throughthe file.

- This will return the set C1getFirstFrequentItemSet() - From the first candidate list, based on the min

support, we will find all candidates whosesupports are greater than min_sup.

- The support value will be calculated byfunction caculateSup();


10/13



- This will return the first frequent item set L1getListCandidateFromPreviousFrequentItemSet()

- The Ck+1candidate sets will be generated fromLk

- To improve the performance of the algorithm,we apply the prune step based on thedownward closure property, that is, subsets of

a frequent itemset are also frequent itemsets- This prune step will be implemented in the

method prune();apriori() - This function is the implement of the Apriori

algorithm, using the pseudo code that weintroduced in the basic concepts.

- Step 2: Generate strong association rules from the frequent itemsets.o This step is implemented in the ruleGenerate() method. This

method will traverse to all frequent item sets found in step 1,generate all possible rules, calculate confident of each ruleand by that, find all strong rules based on the threshold (minconfident).

o In this method, the algorithm to find all possible rules mustbe determined to find enough rules from each set of eachfrequent item set. We apply the binary string algorithm tosolve that problem.

o The algorithm is as follow:

List 3 Binary string algorithm

Input: a set from frequent item setOutput: a set of possible rulesProcess:

n = Calculate the number of possible rules based on the number ofitems in the input set.

a = Generate the n-bit binary number (contains all 0) While(number of bit 1 < n)

a++;

Bit 0 in the binary number will be considered as the left hand whilethe bit 1 is the right hand of the rule.


11/13



IV.SummaryIV.1. AchievementsAfter completing implementing the system, we can state that our system

works successfully and correctly according to the theory.Here are some demo and output of our application:

Figure 7 Main screen of the application

Figure 8 Result panel


12/13



List 4 Application output

Notable honors of our system:- We built the system from scratch- The system has a very good performance (look up the List 4

Application output)

IV.2. Future WorkBeing restricted by the time, we know that our work is obviously not a

complete solution. There are several tasks that we think they should bein the system if we have a chance to continue to develop it:

- Explanation system-

Integrate the system to a real market system with realdatabase, then the system will be trained and updated often.- Improving the efficiency of the system by implementing

several expansion to the system.

FILE STATISTICNumber of transaction: 88162Number of item: 16470------------------------------------------------FREQUENT ITEM SETFrequent 1-itemset[15167 {32}][15596 {38}][50675 {39}][14945 {41}][42135 {48}]Frequent 2-itemset[29142 {39 48}][10345 {38 39}][9018 {41 48}][11414 {39 41}]------------------------------------------------

RULES39 =====> 48 confident: 57.5076467686235848 =====> 39 confident: 69.1634033463866238 =====> 39 confident: 66.331110541164441 =====> 48 confident: 60.341251254600241 =====> 39 confident: 76.37336901973904------------------------------------------------Execution time is: 0.969 seconds.


13/13

knowledge engineering report: apriori algorithm

Documents