fundamentos de minería de datos

44
Intelligent Databases and Information Systems research group Department of Computer Science and Artificial Intelligence E.T.S Ingeniería Informática – Universidad de Granada (Spain) Fundamentos de Minería de Datos Fundamentos de Minería de Datos Reglas de asociación Fernando Berzal [email protected]

Upload: sasha

Post on 18-Jan-2016

50 views

Category:

Documents


0 download

DESCRIPTION

Fundamentos de Minería de Datos. Reglas de asociación. Fernando Berzal [email protected]. Motivation. Association mining searches for interesting relationships among items in a given data set EXAMPLES Diapers and six-packs are bought together, specially on Thursday evening (a myth?) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Fundamentos de Minería de Datos

Intelligent Databases and Information Systems research groupDepartment of Computer Science and Artificial IntelligenceE.T.S Ingeniería Informática – Universidad de Granada (Spain)

Fundamentos de Minería de DatosFundamentos de Minería de Datos

Reglas de asociación

Fernando [email protected]

Page 2: Fundamentos de Minería de Datos

2

Association mining searches for Association mining searches for interesting relationships among items in interesting relationships among items in

a given data seta given data set

EXAMPLESEXAMPLES Diapers and six-packs are bought Diapers and six-packs are bought

together, specially on Thursday evening together, specially on Thursday evening (a myth?)(a myth?)

A sequence such as buying first a digital A sequence such as buying first a digital camera and then a memory card is a camera and then a memory card is a frequent (sequential) patternfrequent (sequential) pattern

……

MotivationMotivation

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 3: Fundamentos de Minería de Datos

3

MARKET BASKET ANALYSISMARKET BASKET ANALYSIS

The earliest form of association rule The earliest form of association rule miningmining

Applications: Applications:

Catalog design, store layout, cross-Catalog design, store layout, cross-marketing…marketing…

MotivationMotivation

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 4: Fundamentos de Minería de Datos

4

DefinitionDefinition

ItemItem In transactional databases:

Any of the items included in a transaction.

In relational databases:

(Attribute, value) pair(Attribute, value) pair

k-itemsetk-itemsetSet of k items

Itemset supportItemset support support(I) = P(I)

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 5: Fundamentos de Minería de Datos

5

DefinitionDefinition

Association ruleAssociation rule

X X Y Y

SupportSupport

support(XY) = support(XUY) = P(XUY)

ConfidenceConfidence

confidence(XY) = support(XUY) / support(X)

= P(Y|X)  

NOTE: Both support and confidence are relative

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 6: Fundamentos de Minería de Datos

6

DiscoveryDiscovery

Association rule mining

1. Find all frequent itemsets

2. Generate strong association rules from the frequent itemsets

Strong association rules are those that satisfy both a minimum support threshold and a minimum confidence threshold.

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 7: Fundamentos de Minería de Datos

7

Apriori

Observation:

All non-empty subsets of a frequent itemset must also be frequent

Algorithm:

Frequent k-itemsets are used to explore potentially frequent (k+1)-itemsets (i.e. candidates)

DiscoveryDiscovery

Agrawal & Skirant: "Fast Algorithms for "Fast Algorithms for Mining Association Rules",Mining Association Rules",

VLDB'94

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 8: Fundamentos de Minería de Datos

8

Apriori improvements (I)

Reducing the number of candidates Park, Chen & Yu: "An Effective Hash-Based "An Effective Hash-Based Algorithm for Mining Association Rules",Algorithm for Mining Association Rules", SIGMOD'95

Sampling Toivonen: "Sampling Large Databases for Association Rules", VLDB'96 Park, Yu & Chen: "Mining Association Rules "Mining Association Rules with Adjustable Accuracy",with Adjustable Accuracy", CIKM'97

Partitioning Savasere, Omiecinski & Navathe: "An Efficient "An Efficient Algorithm for Mining Association Rules in Large Algorithm for Mining Association Rules in Large Databases"Databases", VLDB'95

DiscoveryDiscovery

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 9: Fundamentos de Minería de Datos

9

Apriori improvements (II)

Transaction reduction Agrawal & Skirant: "Fast Algorithms for Mining "Fast Algorithms for Mining Association Rules",Association Rules", VLDB'94 (AprioriTID)

Dynamic itemset counting Brin, Motwani, Ullman & Tsur: "Dynamic "Dynamic Itemset Counting and Implication Rules for Itemset Counting and Implication Rules for Market Basket Data",Market Basket Data", SIGMOD'97 (DIC) Hidber: "Online Association Rule Mining","Online Association Rule Mining", SIGMOD'99 (CARMA)

DiscoveryDiscovery

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 10: Fundamentos de Minería de Datos

10

DiscoveryDiscovery

Apriori-like algorithm:TBAR

(Tree-based association rule mining)

Berzal, Cubero, Sánchez & Serrano

““TBAR: An efficient method for TBAR: An efficient method for association association

rule mining in relational rule mining in relational databases”databases”

Data & Knowledge Engineering, 2001

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 11: Fundamentos de Minería de Datos

11

Discovery: TBARDiscovery: TBAR

A A #7#7 B B #9#9 C C #7#7 D D #8#8

B B #6#6 D D #5#5 C C #6#6 D D #7#7 D D #5#5

D D #5#5D D #5#55 instances 5 instances

withwith ABDABD

7 instances 7 instances

wihwih A A6 instances 6 instances

withwith ABAB

5 instances 5 instances

withwith ADAD

LL11

LL22

LL33

6 instances 6 instances

withwith BCBC

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 12: Fundamentos de Minería de Datos

12

An alternative to Apriori:Compress the database

representing frequent items into a frequent-pattern tree (FP-tree)…

Han, Pei & Yin:

"Mining Frequent Patterns without "Mining Frequent Patterns without Candidate Candidate Generation",Generation", SIGMOD'2000

DiscoveryDiscovery

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 13: Fundamentos de Minería de Datos

13

A challengeWhen an itemset is frequent,all its subsets are also frequent

Closed itemset C: There exists no proper super-itemset S such that support(S)=support(C)

Maximal (frequent) itemset M:M is frequent and there exists no super-itemset Y such that MY and Y is frequent.

DiscoveryDiscovery

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 14: Fundamentos de Minería de Datos

14

VariationsVariations

Based on the kinds of patterns to be mined:

Frequent itemset mining(transactional and relational data)

Sequential pattern mining(sequence data sets, e.g. bioinformatics)

Structured pattern mining(structured data, e.g. graphs)

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 15: Fundamentos de Minería de Datos

15

VariationsVariations

Based on the types of values handled:

Boolean association rules

Quantitative association rules

Fuzzy association rules

Delgado, Marín, Sánchez & Vila

““Fuzzy association rules: General model and Fuzzy association rules: General model and applications”applications”IEEE Transactions on Fuzzy Systems, 2003

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 16: Fundamentos de Minería de Datos

16

VariationsVariations

More options:

Generalized association rules(a.k.a. multilevel association rules)

Constraint-based association rule mining

Incremental algorithms

Top-k algorithms

ICDM FIMI

ICDM FIMI

Workshop on

Workshop on

Frequent Itemset

Frequent Itemset

Mining

Mining

Implementatio

ns

Implementatio

ns

http://fimi.cs.h

elsinki.fi/

http://fimi.cs.h

elsinki.fi/

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 17: Fundamentos de Minería de Datos

17

VisualizationVisualization

Integrated into data mining tools to help users understand data mining

results:

Table-based approache.g. SAS Enterprise Miner, DBMiner…

2D Matrix-based approache.g. SGI MineSet, DBMiner…

Graph-based techniquese.g. DBMiner ball graphs

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 18: Fundamentos de Minería de Datos

18

Visualization: TablesVisualization: Tables

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 19: Fundamentos de Minería de Datos

19

Visualization: Visual aidsVisualization: Visual aids

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 20: Fundamentos de Minería de Datos

20

Visualization: 2D MatrixVisualization: 2D Matrix

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 21: Fundamentos de Minería de Datos

21

Visualization: GraphsVisualization: Graphs

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 22: Fundamentos de Minería de Datos

22

Visualization: VisARVisualization: VisAR

Based on parallel coordinates(Techapichetvanich & Datta,

ADMA’2005)

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 23: Fundamentos de Minería de Datos

23

ExtensionsExtensions

Confidence is not the best possible

interestingness measure for rules

e.g. A very frequent item will always appear in rule consequents,

regardless its true relationship with the rule antecedent

X went to war X did not serve in Vietnam

(from the US Census)

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 24: Fundamentos de Minería de Datos

24

ExtensionsExtensions

Desirable properties for interestingness measuresPiatetsky-Shapiro, 1991

P1 ACC(A⇒C) = 0 when supp(A⇒C) =

supp(A)supp(C)

P2 ACC(A⇒C) monotonically increases with supp(A⇒C)

P3 ACC(A⇒C) monotonically decreases with supp(A) (or supp(C))

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 25: Fundamentos de Minería de Datos

25

ExtensionsExtensions

Certainty factors… … satisfy Piatetsky-Shapiro’s properties … are widely-used in expert systems … are not symmetric (as interest/lift) … can substitute conviction when CF>0 Berzal, Blanco, Sánchez & Vila:

“Measuring the accuracy and interest of “Measuring the accuracy and interest of association rules: A new framework",association rules: A new framework", Intelligent Data Analysis, 2002

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 26: Fundamentos de Minería de Datos

26

ExtensionsExtensions

References:

Hilderman & Hamilton: “Evaluation of “Evaluation of interestingness measures for ranking discovered interestingness measures for ranking discovered knowledge”knowledge”. PAKDD, 2001

Tan, Kumar & Srivastava: “Selecting the right “Selecting the right objective measure for association analysis”objective measure for association analysis”. Information Systems, vol. 29, pp. 293-313, 2004.

Berzal, Cubero, Marín, Sánchez, Serrano & Vila: “Association rule evaluation for classification “Association rule evaluation for classification purposes”purposes” TAMIDA’2005

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 27: Fundamentos de Minería de Datos

27

ApplicationsApplications

Two sample applications where associations rules have been successful

Classification (ART)

Anomaly detection (ATBAR) Balderas, Berzal, Cubero, Eisman & Marín “Discovering Hidden Association “Discovering Hidden Association Rules ”Rules ”

KDD’2005, Chicago, Illinois, USA

Berzal, Cubero, Sánchez & Serrano

““ART: A hybrid classification ART: A hybrid classification modelmodel””

Machine Learning Journal, 2004

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 28: Fundamentos de Minería de Datos

28

ClassificationClassification

Classification models based on association rules

Partial classification modelsvg: Bayardo

“Associative” classification models vg: CBA (Liu et al.)

Bayesian classifiersvg: LB (Meretakis et al.)

Emergent patternsvg: CAEP (Dong et al.)

Rule treesvg: Wang et al.

Rules with exceptionsvg: Liu et al.

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 29: Fundamentos de Minería de Datos

29

GOALGOAL

Simple, intelligible, and robust Simple, intelligible, and robust

classification modelsclassification models

obtained in an efficient and scalable wayobtained in an efficient and scalable way

MEANSMEANS

ClassificationClassification

Decision Tree Induction+

Association Rule Mining=

ARTART[Association Rule Trees][Association Rule Trees]

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 30: Fundamentos de Minería de Datos

30

ART Classification ModelART Classification Model

IDEAMake use of efficient association rule mining algorithms to build a decision-

tree-shaped classification model.

ART = Association Rule Tree

KEY

Association rules + “else” branches

Hybrid between decision trees and decision lists

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 31: Fundamentos de Minería de Datos

31

ART Classification ModelART Classification Model

SPLICESPLICEMotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 32: Fundamentos de Minería de Datos

41

ExampleExample ART vs. TDIDTART vs. TDIDT

ARTART TDIDTTDIDT

X Y

Z

0

0

0 1

1

0 0 e ls e0 1

1

Y

X

1

0

X

Z Z0

0 1 0 1

0 1

0 1 1

0 1 0 1

ART classification modelART classification model

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 33: Fundamentos de Minería de Datos

48

Final commentsFinal commentsART classification modelART classification model

Classification models Acceptable accuracy Reduced complexity Attribute interactions Robustness (noise & primary keys)

Classifier building method Efficient algorithm Good scalability properties Automatic parameter selection

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 34: Fundamentos de Minería de Datos

49

It is often more interesting to find It is often more interesting to find surprising non-frequent events than surprising non-frequent events than

frequent onesfrequent ones

EXAMPLESEXAMPLES Abnormal network activity patterns in Abnormal network activity patterns in

intrusion detection systems.intrusion detection systems. Exceptions to “common” rules in Exceptions to “common” rules in

Medicine (useful for diagnosis, drug Medicine (useful for diagnosis, drug evaluation, detection of conflicting evaluation, detection of conflicting therapies…)therapies…)

……

Anomaly detectionAnomaly detection

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 35: Fundamentos de Minería de Datos

50

Anomaly detectionAnomaly detection

Anomalous association rule

Confident rule representing homogeneous deviations from common behavior.

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 36: Fundamentos de Minería de Datos

51

Anomaly detectionAnomaly detection

X¬Y confident

X Y frequent and confident

X usually implies Y (dominant rule)

When X does not imply Y, then it usually implies A (the Anomaly)

A

X Y ¬A confident

Anomalous association rule

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 37: Fundamentos de Minería de Datos

52

Anomaly detectionAnomaly detection

X Y A1 Z1…

X Y A1 Z2…

X Y A2 Z3…

X Y A2 Z1…

X Y A3 Z2…

X Y A3 Z3…

X Y A Z …

X Y3A Z3

X Y3A Z …

X Y4A Z …

X Y is the dominant rule

X A when ¬ Yis the anomalous rule

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 38: Fundamentos de Minería de Datos

53

Anomaly detectionAnomaly detection

Suzuki et al.’s “Exception Rules”

X Y is an association rule

X I

X I is the reference rule

is the exception rule

¬ Y

I is the “interacting” itemset

Too many exceptions

The “cause” needs to be present

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 39: Fundamentos de Minería de Datos

54

Anomaly detection: ATBARAnomaly detection: ATBAR

Anomalous association rules

AA#7 #7 AB#6 AC#4 AD#5 AE#3 AF#3AB#6 AC#4 AD#5 AE#3 AF#3

B B #9#9 C C #7#7 D D #8#8First First scanscan

A A #7#7

Second Second scanscan

B B #6#6 D D #5#5 Non-frequentNon-frequent

A A #7 #7 AA**

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 40: Fundamentos de Minería de Datos

55

Anomaly detection: ATBARAnomaly detection: ATBAR

Anomalous association rules

B B #9#9 C C #7#7 D D #8#8First First scanscan

A A #7#7

Second Second scanscan

A A #7 #7 AA**

B B #6#6 D D #5#5

B B #9#9 BB** C C #7#7 CC** D D #8#8 DD**

C C #6#6 D D #7#7 D D #5#5

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 41: Fundamentos de Minería de Datos

56

Anomaly detection: ATBARAnomaly detection: ATBAR

Anomalous association rules

Rule generation is immediate from the frequent and extended

itemsets obtained by ATBAR

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 42: Fundamentos de Minería de Datos

57

Anomaly detection: ResultsAnomaly detection: Results

Experiments on health-related datasetsfrom the UCI Machine Learning Repository

Relatively small set of anomalous rules (typically, >90% reduction with respect to standard association rules)

Reasonable overhead needed to obtain anomalous association rules(about 20% in ATBAR w.r.t. TBAR)

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 43: Fundamentos de Minería de Datos

58

Anomaly detection: ResultsAnomaly detection: Results

An example from the Census dataset:

if WORKCLASS: Local-govif WORKCLASS: Local-gov

then then

CAPGAIN: [99999.0 , 99999.0] (7 out of 7)CAPGAIN: [99999.0 , 99999.0] (7 out of 7)

when not CAPGAIN: [0.0 , 20051.0]when not CAPGAIN: [0.0 , 20051.0]

Usual Usual consequentconsequent

““Anomaly”Anomaly”

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR

Page 44: Fundamentos de Minería de Datos

59

Anomalous association rules(novel characterization of potentially interesting knowledge)

An efficient algorithm for discovering anomalous association rules: ATBAR

Some heuristics for filtering the discovered anomalous association rules

Anomaly detection: ResultsAnomaly detection: Results

MotivationDefinitionDiscoveryVariationsVisualizationExtensionsApplications

ARTATBAR