ewis

16
Enterprise Warehousing and Information Systems The Apriori Algorithm

Upload: beatrice-firan

Post on 17-Aug-2015

216 views

Category:

Documents


4 download

DESCRIPTION

ewis

TRANSCRIPT

Enterprise Warehousing and Information SystemsThe Apriori AlgorithmStudent:Ionela-Beatrice Firan The Apriori AlgorithmThe information revolution is generating mountains of data fromsourcesasdiverseasbusinessandscienceelds! "neof thegreatest challenges is ho# to turn these rapidly e$pending datainto accessible% and actionable &no#ledge! 'ataminingistheautomateddiscoveryof non-trivial% implicit%previously un&no#n% and potentially useful information orpatterns embedded in databases! Brie(y state% it refers toe$tracting or mining &no#ledge from large amounts of data! The motivation for data mining is a suspicion that there might benuggets of useful information hiding in the masses of unanaly)edor underanaly)ed data% and therefore methods for locatinginteresting information fromdata #ould be useful! Fromthebeginning% data mining research has been driven by itsapplications! While the nance and industries have longrecogni)ed the benets of data mining%data mining techni*uescan be e+ectively applied in many areas and can be performed ona variety of data stores% including relational databases%transaction databases and data #arehouses!,enerally spea&ing% there are t#o classes of data mining:descriptive and prescriptive! 'escriptive mining is to summari)e or characteri)e generalproperties of data in data repositories% #hile prescriptive mining isto perform inference on current data% to ma&e predictions basedon the historical data!"ne of the fundamental methods from the prospering eld of datamining is the generation of association rules that describerelationships bet#een items in data sets! The original motivationfor searching association rules came from the need to analy)e socalled supermar&et transaction data% that is% to e$plore customerbehavior in terms of purchased products! Association rulesdescribe ho# often items are purchased together!The rst algorithm for mining association rules% Apriori algorithm%#asintroducedin-../! ThemotivationbehindintroducingtheApriori algorithm #as the progress that #as made at that time inbar-code technology% #hich enabled retail supermar&ets to storelarge *uantities of sales data in their databases!The collected data #as referred to as mar&et bas&et data% or 0ustbasket data.Basically% an association rule is an implication 1 2 3 #here 1 and3 are dis0unctive sets of items! The meaning of such rules is *uiteintuitive:4et 'B be a transaction database% #here each transaction T 5 'is aset of items! Anassociationrule123thene$presses6Whenever a transaction T contains 1 than this transaction T alsocontains 3 #ith probability conf7! The probability conf is called theruleconfdenceand is supplemented by further *ualitymeasures li&erulesupport andinterest! Thesupportsupissimplythenumberoftransactionsthatcontainall itemsintheantecedent andconse*uent parts of therule! 8Thesupport issometimes e$pressedasapercentageof thetotal number ofrecords in the database!9 The confdence conf is the ratio of thenumber of transactions that contain all items in the conse*uentas#ell astheantecedent tothenumber of transactionsthatcontain all items in the antecedent!TheApriori algorithmdiscovers associationrules indata! Fore$ample% :if a customer purchases a ra)or and after shave% thenthat customer #ill purchase shaving cream #ith ; Findall combinationsofitems% calledfre*uentitemsets% #hosesupport is greater than the minimum support!> ?se the fre*uent itemsets to generate the desired rules! The ideais that if% for e$ample% AB@ and B@ are fre*uent% then the rule :Aimplies B@: holds if the ratio of support8AB@9 to support8B@9 is atleast as large as the minimum condence! Aote that the rule #illhave minimum support because AB@' is fre*uent! "'BAssociationonlysupportssingleconse*uent rules8AB@implies'9!Thenumberof fre*uentitemsetsisgovernedbytheminimumsupport parameters! The number of rules generated is governedby the number of fre*uent itemsets and the condenceparameter! If the condence parameter is set too high% there maybe fre*uent itemsets in the association model but no rules!The most common #ay to store data collected in various areas isin relational databases! Information and @ommunicationTechnology development has leadto a huge volume of datastored and to the inability to e$tract useful information and&no#ledge from thisdatabyusingthetraditional methods!Forthis reason% data mining has developed as a specic eld! Biningassociation rules is one of the commonly used methods in datamining! Association rules model dependencies bet#een items intransactional data! Bost dataminingsystems #orithdatastored in (at les! Co#ever% it has been sho#n it is benecial toimplement data mining algorithms #ithin a 'BBS% and using ofSD4 to discover patterns in data can bring certain advantages!Since the generation of fre*uent item sets is the most e$pensivepart in terms of resources and time consuming% a lot of algorithmsfor this tas& #ere developed! Bost algorithms use a method thatbuildcandidateitemsets% #hicharesets of potential fre*uentitemsets% and then test them!Support for these candidates is determined by ta&ing into accountthe #hole database '! The process of generating candidateitemsets considers the information regarding the fre*uence of allcandidatesalreadychec&ed! So% theprocedureisthefollo#ing:theclosureof fre*uent itemsetsassumesthat all subsetsof afre*uent itemset are also fre*uent! This allo#s remove those setsthat contain at least one set of items that is not fre*uent% fromcandidate itemsets!After generating% the appearance of each candidate in thedatabaseiscounted% inorder toretainonlythosehavingthesupport greater than minsup!Then #e can move to the ne$t iteration! The #hole process ends#hen there are no potential fre*uent itemsets!The most &no#n algorithm% #hich uses the above mechanism% isApriori! "n this basis some variants such as Apriori Tid% Apriori All%Apriori Some or Apriori Cibrid#ere developed!To see ho# Apriorialghorthim #or&s #e #illuse We&a 8Wai&atoEnvironment for Eno#ledge Analysis9% a popular suite of machinelearning soft#are #ritten in Fava!We&aisa#or&benchthatcontainsacollectionof visuali)ationtoolsandalgorithmsfordataanalysisandpredictivemodeling%together #ithgraphical user interfacesfor easyaccesstothisfunctionality!I too& a sample nominal dataset! It outputs -< rules% ran&ed according to the condence measure given in parentheses after each one! The number follo#ing a ruleGs antecedent sho#s ho# many instances satisfy the antecedentH the number follo#ing the conclusion sho#s ho# many instances satisfy the entire rule 8this is the ruleGs 6support79! Because both numbers are e*ual for all -< rules% the condence of every rule is e$actly -!We&aGs Apriori runs the basic algorithm several times! It uses the same user-specied minimum condence value throughout! The support level is e$pressed as a proportion of the total number of instances 8-/ in this case9% as a ratio bet#een < and -! The minimum support level starts at a certain value 8default -!