1 knowledge discovery transparencies prepared by ho tu bao [jaist] itcs 6162

25
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162

Upload: harold-shelton

Post on 02-Jan-2016

223 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162

1

Knowledge Discovery

Transparencies prepared by Ho Tu Bao [JAIST]ITCS 6162

Page 2: 1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162

2

“Are there clusters of similar cells?”

Light color with 1 nucleus

Dark color with 2 tails 2 nuclei

1 nucleus and 1 tail

Dark color with 1 tail and 2 nuclei

Clustering

Page 3: 1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162

3

Task: Discovering association rules among items in a transaction database.

An association among two items A and B means that the presence of A in a record implies the presence of B in the same record: A => B.

In general: A1, A2, … => B

Association Rule DiscoveryAssociation Rule Discovery

Association Rule DiscoveryAssociation Rule Discovery

Page 4: 1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162

4

“Are there any associations between the characteristics of the cells?”

If color = light and # nuclei = 1 then # tails = 1 (support = 12.5%;

confidence = 50%)

If # nuclei = 2 and Cell = Cancerousthen # tails = 2 (support = 25%;

confidence = 100%)

If # tails = 1then Color = light (support =

37.5%;confidence = 75%)

Association Rule DiscoveryAssociation Rule Discovery Association Rule DiscoveryAssociation Rule Discovery

Page 5: 1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162

5

Genetic Algorithms

StatisticsBayesian Networks

Rough Sets Time Series

Many Other Data Mining Techniques

Text Mining

Page 6: 1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162

6

Lecture 1: Overview of KDD

1. What is KDD and Why ?

2. The KDD Process

4. Data Mining Methods

3. KDD Applications

5. Challenges for KDD

Page 7: 1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162

7

Challenges and Influential Aspects

Handling of differenttypes of data with

different degree of supervision

Changing data and knowledge

Understandability of patterns, various kinds of requests and

results (decision lists, inference networks, concept hierarchies, etc.)

Interactive,Visualization

KnowledgeDiscovery

Different sources of data (distributed, heterogeneous databases, noise and missing, irrelevant data, etc.)

Massive data sets,high dimensionality(efficiency, scalability)

Page 8: 1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162

8

High dimensionality increases exponentially the size of the space of patterns.

Massive Data Sets and High Dimensionality

With p attributes each has d discrete values in average, the space of possible instances has the size of dp.

Classes: {Cancerous, Healthy}Attributes: Cell body: {dark, light}

#nuclei: {1, 2} #tails: {1, 2}

Healthy Cell

C1

C2 C3

h1 h2

Cancerous Cell

(# instances = 23 = 8)

38 attributes, each 10 values # instances = 1038

# attributes ?

Page 9: 1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162

9

Attribute Numerical Symbolic

No structure

Places,Color

Ordinal structure

Ring structure

Rank,Resemblance

Age,Temperature,Taste,

Income,Length

Nominal(categorical)

Ordinal

Measurable

Different Types of Data

Combinatorical search in hypothesis spaces (machine learning)

Often matrix-based computation (multivariate data analysis)

Page 10: 1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162

10

Lecture 1: Overview of KDD

Lecture 2: Preparing data

Lecture 3: Decision tree induction

Lecture 4: Mining association rules

Lecture 5: Automatic cluster detection

Lecture 6: Artificial neural networks

Lecture 7: Evaluation of discovered knowledge

Brief introduction to lectures

Page 11: 1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162

11

Lecture 2: Preparing Data•As much as 80% of KDD is about preparing

data, and the remaining 20% is about mining

•Content of the lecture

1. Data cleaning

2. Data transformations

3. Data reduction

4. Software and case-studies

•Prerequisite: Nothing special but expected some understanding of statistics

Page 12: 1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162

12

Data Preparation

The design and organization of data, including the setting of goals and the composition of features, is done by humans. There are two central goals for the preparation of data:

• To organize data into a standard form that is ready for processing by data mining programs.

• To prepare features that lead to the best data mining performance.

Page 13: 1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162

13

Lecture 1: Overview of KDD

Lecture 2: Preparing data

Lecture 3: Decision tree induction

Lecture 4: Mining association rules

Lecture 5: Automatic cluster detection

Lecture 6: Artificial neural networks

Lecture 7: Evaluation of discovered knowledge

Brief introduction to lectures

Page 14: 1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162

14

Lecture 3: Decision Tree Induction

•One of the most widely used KDD classification techniques for supervised data.

•Content of the lecture

1. Decision tree representation and framework2. Attribute selection 3. Pruning trees4. Software C4.5, CABRO and case-studies

•Prerequisite: Nothing special

Page 15: 1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162

15

Decision TreesDecision Tree is a classifier in the form of a tree structure that is either:

A decision tree can be used to classify an instance by starting at the root of the tree and moving through it until a leaf is met.

• a leaf node, indicating a class of instances, or• a decision node that specifies some test to be carried out on a single attribute value, with one branch and subtree for each possible outcome of the test

Page 16: 1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162

16

General Framework of Decision Tree Induction1. Choose the “best” attribute by a given selection measure2. Extend tree by adding new branch for each attribute value3. Sorting training examples to leaf nodes4. If examples unambiguously classified Then Stop Else Repeat steps 1-4 for leaf nodes

Headache Temperature Flu

e1 yes normal no

e2 yes high yes

e3 yes very high yese4 no normal noe5 no high no

e6 no very high no

Temperature

yes

Headache

normalhigh very high

Headacheno

no yes no

yes{e2}

no {e5}

yes{e3}

no {e6}

{e1, e4} {e2, e5}

{e3,e6}

5. Pruning unstable leaf nodes

Page 17: 1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162

17

Some Attribute Selection Measures

CABRO,1996 Nguyen, & Ho }{max measure-R

Statistics ,)

1984 CART, Breiman, index -Gini

1993 C4.5, Quinlan, log

loglog ratio-Gain

2.

..

..2

2

22

..

...

jiij j

ijiji j

ij

ijij

i i.j i ji.j

j jj

i i iiijijj j

pp

n

nne

e

n(e

ppp

pp

ppppp

Page 18: 1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162

18

Avoiding Overfitting

How can we avoid overfitting?

• Stop growing when data split not statistically significant (pre-pruning)

• Grow full tree, then post-prune (post-pruning)

Two post-pruning techniques

• Reduce-Error Pruning

• Rule Post-Pruning

Page 19: 1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162

19

Converting A Tree to Rules

sunny o’cast rain

outlook

high normal

humidity

no yes

yestrue false

wind

no yes

IF (Outlook = Sunny) and (Humidity = High)THEN PlayTennis = No

IF (Outlook = Sunny) and (Humidity = Normal)THEN PlayTennis = Yes........

Page 20: 1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162

20

Matchingpath andthe classthat matchestheunknowninstance

Discovereddecision tree

Unknown case

CABRO: Decision Tree InductionCABRO based on R-measure, a measure for attribute dependency stemmed from rough sets theory.

Page 21: 1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162

21

Lecture 1: Overview of KDD

Lecture 2: Preparing data

Lecture 3: Decision tree induction

Lecture 4: Mining association rules

Lecture 5: Automatic cluster detection

Lecture 6: Artificial neural networks

Lecture 7: Evaluation of discovered knowledge

Brief introduction to lectures

Page 22: 1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162

22

Lecture 4: Mining Association Rules

•A new technique and attractive topic. It allows discovering the important associations among items of transactions in a database.

•Content of the lecture

•Prerequisite: Nothing special

3. The Basket Analysis Program

1. Basic Definitions

2. Algorithm Apriori

Page 23: 1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162

23

There are two measures of rule strength:

Support of (A => B) = [AB] / N, where N is the number of records in the database.

The support of a rule is the proportion of times the rule applies.

Confidence of (A => B) = [AB] / [A]

The confidence of a rule is the proportion of times the rule is correct.

Measures of Measures of Association RulesAssociation Rules Measures of Measures of Association RulesAssociation Rules

Page 24: 1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162

24

• The task of mining association rules is mainly to discover strong association rules (high confidence and strong support) in large databases.

Algorithm AprioriTID Items

1000 A, B, C2000 A, C3000 A, D4000 B, E, F

{A} 75% {B} 50%{C} 50%{A,C} 50%

S = 40%

Large supportitems

• Mining association rules is composed of two steps:

1. discover the large items, i.e., the sets of itemsets that have transaction support above a predetermined minimum support s.

2. Use the large itemsets to generate the association rules

Page 25: 1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162

25

Algorithm Apriori: Illustration

TID Items

100 A, C, D200 B, C, E300 A, B, C, E400 B, E

Database D

{A} {B} {C} {D} {E}

Itemset Count

{A} 2 {B} 3{C} 3{E} 3

Itemset Count

{A, B} {A, C} {A, E} {B, C} {B, E}{C, E}

Itemset

{A,B} {A,C} {A,E} {B,C} {B,E} {C,E}

Itemset Count

{A, C} 2 {B, C} 2 {B, E} 3{C, E} 2

Itemset Count

{B, C, E}

Itemset

{B, C, E} 2

Itemset Count

{B, C, E} 2

Itemset Count

C1 L1

C2L2

C2

C3 L3C3ScanD

ScanD

ScanD

S = 40%2

3

3

1

3

1 2 1 2 3 2