fundamentos de minería de datos
DESCRIPTION
Fundamentos de Minería de Datos. Reglas de asociación. Fernando Berzal [email protected]. Motivation. Association mining searches for interesting relationships among items in a given data set EXAMPLES Diapers and six-packs are bought together, specially on Thursday evening (a myth?) - PowerPoint PPT PresentationTRANSCRIPT
Intelligent Databases and Information Systems research groupDepartment of Computer Science and Artificial IntelligenceE.T.S Ingeniería Informática – Universidad de Granada (Spain)
Fundamentos de Minería de DatosFundamentos de Minería de Datos
Reglas de asociación
Fernando [email protected]
2
Association mining searches for Association mining searches for interesting relationships among items in interesting relationships among items in
a given data seta given data set
EXAMPLESEXAMPLES Diapers and six-packs are bought Diapers and six-packs are bought
together, specially on Thursday evening together, specially on Thursday evening (a myth?)(a myth?)
A sequence such as buying first a digital A sequence such as buying first a digital camera and then a memory card is a camera and then a memory card is a frequent (sequential) patternfrequent (sequential) pattern
……
MotivationMotivation
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
3
MARKET BASKET ANALYSISMARKET BASKET ANALYSIS
The earliest form of association rule The earliest form of association rule miningmining
Applications: Applications:
Catalog design, store layout, cross-Catalog design, store layout, cross-marketing…marketing…
MotivationMotivation
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
4
DefinitionDefinition
ItemItem In transactional databases:
Any of the items included in a transaction.
In relational databases:
(Attribute, value) pair(Attribute, value) pair
k-itemsetk-itemsetSet of k items
Itemset supportItemset support support(I) = P(I)
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
5
DefinitionDefinition
Association ruleAssociation rule
X X Y Y
SupportSupport
support(XY) = support(XUY) = P(XUY)
ConfidenceConfidence
confidence(XY) = support(XUY) / support(X)
= P(Y|X)
NOTE: Both support and confidence are relative
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
6
DiscoveryDiscovery
Association rule mining
1. Find all frequent itemsets
2. Generate strong association rules from the frequent itemsets
Strong association rules are those that satisfy both a minimum support threshold and a minimum confidence threshold.
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
7
Apriori
Observation:
All non-empty subsets of a frequent itemset must also be frequent
Algorithm:
Frequent k-itemsets are used to explore potentially frequent (k+1)-itemsets (i.e. candidates)
DiscoveryDiscovery
Agrawal & Skirant: "Fast Algorithms for "Fast Algorithms for Mining Association Rules",Mining Association Rules",
VLDB'94
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
8
Apriori improvements (I)
Reducing the number of candidates Park, Chen & Yu: "An Effective Hash-Based "An Effective Hash-Based Algorithm for Mining Association Rules",Algorithm for Mining Association Rules", SIGMOD'95
Sampling Toivonen: "Sampling Large Databases for Association Rules", VLDB'96 Park, Yu & Chen: "Mining Association Rules "Mining Association Rules with Adjustable Accuracy",with Adjustable Accuracy", CIKM'97
Partitioning Savasere, Omiecinski & Navathe: "An Efficient "An Efficient Algorithm for Mining Association Rules in Large Algorithm for Mining Association Rules in Large Databases"Databases", VLDB'95
DiscoveryDiscovery
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
9
Apriori improvements (II)
Transaction reduction Agrawal & Skirant: "Fast Algorithms for Mining "Fast Algorithms for Mining Association Rules",Association Rules", VLDB'94 (AprioriTID)
Dynamic itemset counting Brin, Motwani, Ullman & Tsur: "Dynamic "Dynamic Itemset Counting and Implication Rules for Itemset Counting and Implication Rules for Market Basket Data",Market Basket Data", SIGMOD'97 (DIC) Hidber: "Online Association Rule Mining","Online Association Rule Mining", SIGMOD'99 (CARMA)
DiscoveryDiscovery
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
10
DiscoveryDiscovery
Apriori-like algorithm:TBAR
(Tree-based association rule mining)
Berzal, Cubero, Sánchez & Serrano
““TBAR: An efficient method for TBAR: An efficient method for association association
rule mining in relational rule mining in relational databases”databases”
Data & Knowledge Engineering, 2001
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
11
Discovery: TBARDiscovery: TBAR
A A #7#7 B B #9#9 C C #7#7 D D #8#8
B B #6#6 D D #5#5 C C #6#6 D D #7#7 D D #5#5
D D #5#5D D #5#55 instances 5 instances
withwith ABDABD
7 instances 7 instances
wihwih A A6 instances 6 instances
withwith ABAB
5 instances 5 instances
withwith ADAD
LL11
LL22
LL33
6 instances 6 instances
withwith BCBC
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
12
An alternative to Apriori:Compress the database
representing frequent items into a frequent-pattern tree (FP-tree)…
Han, Pei & Yin:
"Mining Frequent Patterns without "Mining Frequent Patterns without Candidate Candidate Generation",Generation", SIGMOD'2000
DiscoveryDiscovery
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
13
A challengeWhen an itemset is frequent,all its subsets are also frequent
Closed itemset C: There exists no proper super-itemset S such that support(S)=support(C)
Maximal (frequent) itemset M:M is frequent and there exists no super-itemset Y such that MY and Y is frequent.
DiscoveryDiscovery
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
14
VariationsVariations
Based on the kinds of patterns to be mined:
Frequent itemset mining(transactional and relational data)
Sequential pattern mining(sequence data sets, e.g. bioinformatics)
Structured pattern mining(structured data, e.g. graphs)
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
15
VariationsVariations
Based on the types of values handled:
Boolean association rules
Quantitative association rules
Fuzzy association rules
Delgado, Marín, Sánchez & Vila
““Fuzzy association rules: General model and Fuzzy association rules: General model and applications”applications”IEEE Transactions on Fuzzy Systems, 2003
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
16
VariationsVariations
More options:
Generalized association rules(a.k.a. multilevel association rules)
Constraint-based association rule mining
Incremental algorithms
Top-k algorithms
…
ICDM FIMI
ICDM FIMI
Workshop on
Workshop on
Frequent Itemset
Frequent Itemset
Mining
Mining
Implementatio
ns
Implementatio
ns
http://fimi.cs.h
elsinki.fi/
http://fimi.cs.h
elsinki.fi/
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
17
VisualizationVisualization
Integrated into data mining tools to help users understand data mining
results:
Table-based approache.g. SAS Enterprise Miner, DBMiner…
2D Matrix-based approache.g. SGI MineSet, DBMiner…
Graph-based techniquese.g. DBMiner ball graphs
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
18
Visualization: TablesVisualization: Tables
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
19
Visualization: Visual aidsVisualization: Visual aids
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
20
Visualization: 2D MatrixVisualization: 2D Matrix
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
21
Visualization: GraphsVisualization: Graphs
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
22
Visualization: VisARVisualization: VisAR
Based on parallel coordinates(Techapichetvanich & Datta,
ADMA’2005)
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
23
ExtensionsExtensions
Confidence is not the best possible
interestingness measure for rules
e.g. A very frequent item will always appear in rule consequents,
regardless its true relationship with the rule antecedent
X went to war X did not serve in Vietnam
(from the US Census)
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
24
ExtensionsExtensions
Desirable properties for interestingness measuresPiatetsky-Shapiro, 1991
P1 ACC(A⇒C) = 0 when supp(A⇒C) =
supp(A)supp(C)
P2 ACC(A⇒C) monotonically increases with supp(A⇒C)
P3 ACC(A⇒C) monotonically decreases with supp(A) (or supp(C))
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
25
ExtensionsExtensions
Certainty factors… … satisfy Piatetsky-Shapiro’s properties … are widely-used in expert systems … are not symmetric (as interest/lift) … can substitute conviction when CF>0 Berzal, Blanco, Sánchez & Vila:
“Measuring the accuracy and interest of “Measuring the accuracy and interest of association rules: A new framework",association rules: A new framework", Intelligent Data Analysis, 2002
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
26
ExtensionsExtensions
References:
Hilderman & Hamilton: “Evaluation of “Evaluation of interestingness measures for ranking discovered interestingness measures for ranking discovered knowledge”knowledge”. PAKDD, 2001
Tan, Kumar & Srivastava: “Selecting the right “Selecting the right objective measure for association analysis”objective measure for association analysis”. Information Systems, vol. 29, pp. 293-313, 2004.
Berzal, Cubero, Marín, Sánchez, Serrano & Vila: “Association rule evaluation for classification “Association rule evaluation for classification purposes”purposes” TAMIDA’2005
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
27
ApplicationsApplications
Two sample applications where associations rules have been successful
Classification (ART)
Anomaly detection (ATBAR) Balderas, Berzal, Cubero, Eisman & Marín “Discovering Hidden Association “Discovering Hidden Association Rules ”Rules ”
KDD’2005, Chicago, Illinois, USA
Berzal, Cubero, Sánchez & Serrano
““ART: A hybrid classification ART: A hybrid classification modelmodel””
Machine Learning Journal, 2004
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
28
ClassificationClassification
Classification models based on association rules
Partial classification modelsvg: Bayardo
“Associative” classification models vg: CBA (Liu et al.)
Bayesian classifiersvg: LB (Meretakis et al.)
Emergent patternsvg: CAEP (Dong et al.)
Rule treesvg: Wang et al.
Rules with exceptionsvg: Liu et al.
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
29
GOALGOAL
Simple, intelligible, and robust Simple, intelligible, and robust
classification modelsclassification models
obtained in an efficient and scalable wayobtained in an efficient and scalable way
MEANSMEANS
ClassificationClassification
Decision Tree Induction+
Association Rule Mining=
ARTART[Association Rule Trees][Association Rule Trees]
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
30
ART Classification ModelART Classification Model
IDEAMake use of efficient association rule mining algorithms to build a decision-
tree-shaped classification model.
ART = Association Rule Tree
KEY
Association rules + “else” branches
Hybrid between decision trees and decision lists
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
31
ART Classification ModelART Classification Model
SPLICESPLICEMotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
41
ExampleExample ART vs. TDIDTART vs. TDIDT
ARTART TDIDTTDIDT
X Y
Z
0
0
0 1
1
0 0 e ls e0 1
1
Y
X
1
0
X
Z Z0
0 1 0 1
0 1
0 1 1
0 1 0 1
ART classification modelART classification model
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
48
Final commentsFinal commentsART classification modelART classification model
Classification models Acceptable accuracy Reduced complexity Attribute interactions Robustness (noise & primary keys)
Classifier building method Efficient algorithm Good scalability properties Automatic parameter selection
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
49
It is often more interesting to find It is often more interesting to find surprising non-frequent events than surprising non-frequent events than
frequent onesfrequent ones
EXAMPLESEXAMPLES Abnormal network activity patterns in Abnormal network activity patterns in
intrusion detection systems.intrusion detection systems. Exceptions to “common” rules in Exceptions to “common” rules in
Medicine (useful for diagnosis, drug Medicine (useful for diagnosis, drug evaluation, detection of conflicting evaluation, detection of conflicting therapies…)therapies…)
……
Anomaly detectionAnomaly detection
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
50
Anomaly detectionAnomaly detection
Anomalous association rule
Confident rule representing homogeneous deviations from common behavior.
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
51
Anomaly detectionAnomaly detection
X¬Y confident
X Y frequent and confident
X usually implies Y (dominant rule)
When X does not imply Y, then it usually implies A (the Anomaly)
A
X Y ¬A confident
Anomalous association rule
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
52
Anomaly detectionAnomaly detection
X Y A1 Z1…
X Y A1 Z2…
X Y A2 Z3…
X Y A2 Z1…
X Y A3 Z2…
X Y A3 Z3…
X Y A Z …
X Y3A Z3
…
X Y3A Z …
X Y4A Z …
X Y is the dominant rule
X A when ¬ Yis the anomalous rule
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
53
Anomaly detectionAnomaly detection
Suzuki et al.’s “Exception Rules”
X Y is an association rule
X I
X I is the reference rule
is the exception rule
¬ Y
I is the “interacting” itemset
Too many exceptions
The “cause” needs to be present
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
54
Anomaly detection: ATBARAnomaly detection: ATBAR
Anomalous association rules
AA#7 #7 AB#6 AC#4 AD#5 AE#3 AF#3AB#6 AC#4 AD#5 AE#3 AF#3
B B #9#9 C C #7#7 D D #8#8First First scanscan
A A #7#7
Second Second scanscan
B B #6#6 D D #5#5 Non-frequentNon-frequent
A A #7 #7 AA**
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
55
Anomaly detection: ATBARAnomaly detection: ATBAR
Anomalous association rules
B B #9#9 C C #7#7 D D #8#8First First scanscan
A A #7#7
Second Second scanscan
A A #7 #7 AA**
B B #6#6 D D #5#5
B B #9#9 BB** C C #7#7 CC** D D #8#8 DD**
C C #6#6 D D #7#7 D D #5#5
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
56
Anomaly detection: ATBARAnomaly detection: ATBAR
Anomalous association rules
Rule generation is immediate from the frequent and extended
itemsets obtained by ATBAR
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
57
Anomaly detection: ResultsAnomaly detection: Results
Experiments on health-related datasetsfrom the UCI Machine Learning Repository
Relatively small set of anomalous rules (typically, >90% reduction with respect to standard association rules)
Reasonable overhead needed to obtain anomalous association rules(about 20% in ATBAR w.r.t. TBAR)
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
58
Anomaly detection: ResultsAnomaly detection: Results
An example from the Census dataset:
if WORKCLASS: Local-govif WORKCLASS: Local-gov
then then
CAPGAIN: [99999.0 , 99999.0] (7 out of 7)CAPGAIN: [99999.0 , 99999.0] (7 out of 7)
when not CAPGAIN: [0.0 , 20051.0]when not CAPGAIN: [0.0 , 20051.0]
Usual Usual consequentconsequent
““Anomaly”Anomaly”
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR
59
Anomalous association rules(novel characterization of potentially interesting knowledge)
An efficient algorithm for discovering anomalous association rules: ATBAR
Some heuristics for filtering the discovered anomalous association rules
Anomaly detection: ResultsAnomaly detection: Results
MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications
ARTATBAR