1 knowledge discovery transparencies prepared by ho tu bao [jaist] itcs 6162

1

Knowledge Discovery

Transparencies prepared by Ho Tu Bao [JAIST]ITCS 6162

2

“Are there clusters of similar cells?”

Light color with 1 nucleus

Dark color with 2 tails 2 nuclei

1 nucleus and 1 tail

Dark color with 1 tail and 2 nuclei

Clustering

3

Task: Discovering association rules among items in a transaction database.

An association among two items A and B means that the presence of A in a record implies the presence of B in the same record: A => B.

In general: A1, A2, … => B

Association Rule DiscoveryAssociation Rule Discovery

Association Rule DiscoveryAssociation Rule Discovery

4

“Are there any associations between the characteristics of the cells?”

If color = light and # nuclei = 1 then # tails = 1 (support = 12.5%;

confidence = 50%)

If # nuclei = 2 and Cell = Cancerousthen # tails = 2 (support = 25%;

confidence = 100%)

If # tails = 1then Color = light (support =

37.5%;confidence = 75%)

Association Rule DiscoveryAssociation Rule Discovery Association Rule DiscoveryAssociation Rule Discovery

5

Genetic Algorithms

StatisticsBayesian Networks

Rough Sets Time Series

Many Other Data Mining Techniques

Text Mining

6

Lecture 1: Overview of KDD

1. What is KDD and Why ?

2. The KDD Process

4. Data Mining Methods

3. KDD Applications

5. Challenges for KDD

7

Challenges and Influential Aspects

Handling of differenttypes of data with

different degree of supervision

Changing data and knowledge

Understandability of patterns, various kinds of requests and

results (decision lists, inference networks, concept hierarchies, etc.)

Interactive,Visualization

KnowledgeDiscovery

Different sources of data (distributed, heterogeneous databases, noise and missing, irrelevant data, etc.)

Massive data sets,high dimensionality(efficiency, scalability)

8

High dimensionality increases exponentially the size of the space of patterns.

Massive Data Sets and High Dimensionality

With p attributes each has d discrete values in average, the space of possible instances has the size of dp.

Classes: {Cancerous, Healthy}Attributes: Cell body: {dark, light}

#nuclei: {1, 2} #tails: {1, 2}

Healthy Cell

C1

C2 C3

h1 h2

Cancerous Cell

(# instances = 23 = 8)

38 attributes, each 10 values # instances = 1038

# attributes ?

9

Attribute Numerical Symbolic

No structure

Places,Color

Ordinal structure

Ring structure

Rank,Resemblance

Age,Temperature,Taste,

Income,Length

Nominal(categorical)

Ordinal

Measurable

Different Types of Data

Combinatorical search in hypothesis spaces (machine learning)

Often matrix-based computation (multivariate data analysis)

10


Lecture 2: Preparing data

Lecture 3: Decision tree induction

Lecture 4: Mining association rules

Lecture 5: Automatic cluster detection

Lecture 6: Artificial neural networks

Lecture 7: Evaluation of discovered knowledge

Brief introduction to lectures

11

Lecture 2: Preparing Data•As much as 80% of KDD is about preparing

data, and the remaining 20% is about mining

•Content of the lecture

1. Data cleaning

2. Data transformations

3. Data reduction

4. Software and case-studies

•Prerequisite: Nothing special but expected some understanding of statistics

12

Data Preparation

The design and organization of data, including the setting of goals and the composition of features, is done by humans. There are two central goals for the preparation of data:

• To organize data into a standard form that is ready for processing by data mining programs.

• To prepare features that lead to the best data mining performance.

13









14

Lecture 3: Decision Tree Induction

•One of the most widely used KDD classification techniques for supervised data.


1. Decision tree representation and framework2. Attribute selection 3. Pruning trees4. Software C4.5, CABRO and case-studies

•Prerequisite: Nothing special

15

Decision TreesDecision Tree is a classifier in the form of a tree structure that is either:

A decision tree can be used to classify an instance by starting at the root of the tree and moving through it until a leaf is met.

• a leaf node, indicating a class of instances, or• a decision node that specifies some test to be carried out on a single attribute value, with one branch and subtree for each possible outcome of the test

16

General Framework of Decision Tree Induction1. Choose the “best” attribute by a given selection measure2. Extend tree by adding new branch for each attribute value3. Sorting training examples to leaf nodes4. If examples unambiguously classified Then Stop Else Repeat steps 1-4 for leaf nodes

Headache Temperature Flu

e1 yes normal no

e2 yes high yes

e3 yes very high yese4 no normal noe5 no high no

e6 no very high no

Temperature

yes

Headache

normalhigh very high

Headacheno

no yes no

yes{e2}

no {e5}

yes{e3}

no {e6}

{e1, e4} {e2, e5}

{e3,e6}

5. Pruning unstable leaf nodes

17

Some Attribute Selection Measures

CABRO,1996 Nguyen, & Ho }{max measure-R

Statistics ,)

1984 CART, Breiman, index -Gini

1993 C4.5, Quinlan, log

loglog ratio-Gain

2.

..

..2

2

22

..

...

jiij j

ijiji j

ij

ijij

i i.j i ji.j

j jj

i i iiijijj j

pp

n

nne

e

n(e

ppp

pp

ppppp

18

Avoiding Overfitting

How can we avoid overfitting?

• Stop growing when data split not statistically significant (pre-pruning)

• Grow full tree, then post-prune (post-pruning)

Two post-pruning techniques

• Reduce-Error Pruning

• Rule Post-Pruning

19

Converting A Tree to Rules

sunny o’cast rain

outlook

high normal

humidity

no yes

yestrue false

wind

no yes

IF (Outlook = Sunny) and (Humidity = High)THEN PlayTennis = No

IF (Outlook = Sunny) and (Humidity = Normal)THEN PlayTennis = Yes........

20

Matchingpath andthe classthat matchestheunknowninstance

Discovereddecision tree

Unknown case

CABRO: Decision Tree InductionCABRO based on R-measure, a measure for attribute dependency stemmed from rough sets theory.

21









22

Lecture 4: Mining Association Rules

•A new technique and attractive topic. It allows discovering the important associations among items of transactions in a database.


•Prerequisite: Nothing special

3. The Basket Analysis Program

1. Basic Definitions

2. Algorithm Apriori

23

There are two measures of rule strength:

Support of (A => B) = [AB] / N, where N is the number of records in the database.

The support of a rule is the proportion of times the rule applies.

Confidence of (A => B) = [AB] / [A]

The confidence of a rule is the proportion of times the rule is correct.

Measures of Measures of Association RulesAssociation Rules Measures of Measures of Association RulesAssociation Rules

24

• The task of mining association rules is mainly to discover strong association rules (high confidence and strong support) in large databases.

Algorithm AprioriTID Items

1000 A, B, C2000 A, C3000 A, D4000 B, E, F

{A} 75% {B} 50%{C} 50%{A,C} 50%

S = 40%

Large supportitems

• Mining association rules is composed of two steps:

1. discover the large items, i.e., the sets of itemsets that have transaction support above a predetermined minimum support s.

2. Use the large itemsets to generate the association rules

25

Algorithm Apriori: Illustration

TID Items

100 A, C, D200 B, C, E300 A, B, C, E400 B, E

Database D

{A} {B} {C} {D} {E}

Itemset Count

{A} 2 {B} 3{C} 3{E} 3

Itemset Count

{A, B} {A, C} {A, E} {B, C} {B, E}{C, E}

Itemset

{A,B} {A,C} {A,E} {B,C} {B,E} {C,E}

Itemset Count

{A, C} 2 {B, C} 2 {B, E} 3{C, E} 2

Itemset Count

{B, C, E}

Itemset

{B, C, E} 2

Itemset Count

{B, C, E} 2

Itemset Count

C1 L1

C2L2

C2

C3 L3C3ScanD

ScanD

ScanD

S = 40%2

3

3

1

3

1 2 1 2 3 2

1 knowledge discovery transparencies prepared by ho tu bao [jaist] itcs 6162

Documents

organization of data

irrelevant data

preparation of data

data cleaning2

data transformations3

data reduction4

data mining programs

data mining methods