1 knowledge discovery transparencies prepared by ho tu bao [jaist] itcs 6162
TRANSCRIPT
1
Knowledge Discovery
Transparencies prepared by Ho Tu Bao [JAIST]ITCS 6162
2
“Are there clusters of similar cells?”
Light color with 1 nucleus
Dark color with 2 tails 2 nuclei
1 nucleus and 1 tail
Dark color with 1 tail and 2 nuclei
Clustering
3
Task: Discovering association rules among items in a transaction database.
An association among two items A and B means that the presence of A in a record implies the presence of B in the same record: A => B.
In general: A1, A2, … => B
Association Rule DiscoveryAssociation Rule Discovery
Association Rule DiscoveryAssociation Rule Discovery
4
“Are there any associations between the characteristics of the cells?”
If color = light and # nuclei = 1 then # tails = 1 (support = 12.5%;
confidence = 50%)
If # nuclei = 2 and Cell = Cancerousthen # tails = 2 (support = 25%;
confidence = 100%)
If # tails = 1then Color = light (support =
37.5%;confidence = 75%)
Association Rule DiscoveryAssociation Rule Discovery Association Rule DiscoveryAssociation Rule Discovery
5
Genetic Algorithms
StatisticsBayesian Networks
Rough Sets Time Series
Many Other Data Mining Techniques
Text Mining
6
Lecture 1: Overview of KDD
1. What is KDD and Why ?
2. The KDD Process
4. Data Mining Methods
3. KDD Applications
5. Challenges for KDD
7
Challenges and Influential Aspects
Handling of differenttypes of data with
different degree of supervision
Changing data and knowledge
Understandability of patterns, various kinds of requests and
results (decision lists, inference networks, concept hierarchies, etc.)
Interactive,Visualization
KnowledgeDiscovery
Different sources of data (distributed, heterogeneous databases, noise and missing, irrelevant data, etc.)
Massive data sets,high dimensionality(efficiency, scalability)
8
High dimensionality increases exponentially the size of the space of patterns.
Massive Data Sets and High Dimensionality
With p attributes each has d discrete values in average, the space of possible instances has the size of dp.
Classes: {Cancerous, Healthy}Attributes: Cell body: {dark, light}
#nuclei: {1, 2} #tails: {1, 2}
Healthy Cell
C1
C2 C3
h1 h2
Cancerous Cell
(# instances = 23 = 8)
38 attributes, each 10 values # instances = 1038
# attributes ?
9
Attribute Numerical Symbolic
No structure
Places,Color
Ordinal structure
Ring structure
Rank,Resemblance
Age,Temperature,Taste,
Income,Length
Nominal(categorical)
Ordinal
Measurable
Different Types of Data
Combinatorical search in hypothesis spaces (machine learning)
Often matrix-based computation (multivariate data analysis)
10
Lecture 1: Overview of KDD
Lecture 2: Preparing data
Lecture 3: Decision tree induction
Lecture 4: Mining association rules
Lecture 5: Automatic cluster detection
Lecture 6: Artificial neural networks
Lecture 7: Evaluation of discovered knowledge
Brief introduction to lectures
11
Lecture 2: Preparing Data•As much as 80% of KDD is about preparing
data, and the remaining 20% is about mining
•Content of the lecture
1. Data cleaning
2. Data transformations
3. Data reduction
4. Software and case-studies
•Prerequisite: Nothing special but expected some understanding of statistics
12
Data Preparation
The design and organization of data, including the setting of goals and the composition of features, is done by humans. There are two central goals for the preparation of data:
• To organize data into a standard form that is ready for processing by data mining programs.
• To prepare features that lead to the best data mining performance.
13
Lecture 1: Overview of KDD
Lecture 2: Preparing data
Lecture 3: Decision tree induction
Lecture 4: Mining association rules
Lecture 5: Automatic cluster detection
Lecture 6: Artificial neural networks
Lecture 7: Evaluation of discovered knowledge
Brief introduction to lectures
14
Lecture 3: Decision Tree Induction
•One of the most widely used KDD classification techniques for supervised data.
•Content of the lecture
1. Decision tree representation and framework2. Attribute selection 3. Pruning trees4. Software C4.5, CABRO and case-studies
•Prerequisite: Nothing special
15
Decision TreesDecision Tree is a classifier in the form of a tree structure that is either:
A decision tree can be used to classify an instance by starting at the root of the tree and moving through it until a leaf is met.
• a leaf node, indicating a class of instances, or• a decision node that specifies some test to be carried out on a single attribute value, with one branch and subtree for each possible outcome of the test
16
General Framework of Decision Tree Induction1. Choose the “best” attribute by a given selection measure2. Extend tree by adding new branch for each attribute value3. Sorting training examples to leaf nodes4. If examples unambiguously classified Then Stop Else Repeat steps 1-4 for leaf nodes
Headache Temperature Flu
e1 yes normal no
e2 yes high yes
e3 yes very high yese4 no normal noe5 no high no
e6 no very high no
Temperature
yes
Headache
normalhigh very high
Headacheno
no yes no
yes{e2}
no {e5}
yes{e3}
no {e6}
{e1, e4} {e2, e5}
{e3,e6}
5. Pruning unstable leaf nodes
17
Some Attribute Selection Measures
CABRO,1996 Nguyen, & Ho }{max measure-R
Statistics ,)
1984 CART, Breiman, index -Gini
1993 C4.5, Quinlan, log
loglog ratio-Gain
2.
..
..2
2
22
..
...
jiij j
ijiji j
ij
ijij
i i.j i ji.j
j jj
i i iiijijj j
pp
n
nne
e
n(e
ppp
pp
ppppp
18
Avoiding Overfitting
How can we avoid overfitting?
• Stop growing when data split not statistically significant (pre-pruning)
• Grow full tree, then post-prune (post-pruning)
Two post-pruning techniques
• Reduce-Error Pruning
• Rule Post-Pruning
19
Converting A Tree to Rules
sunny o’cast rain
outlook
high normal
humidity
no yes
yestrue false
wind
no yes
IF (Outlook = Sunny) and (Humidity = High)THEN PlayTennis = No
IF (Outlook = Sunny) and (Humidity = Normal)THEN PlayTennis = Yes........
20
Matchingpath andthe classthat matchestheunknowninstance
Discovereddecision tree
Unknown case
CABRO: Decision Tree InductionCABRO based on R-measure, a measure for attribute dependency stemmed from rough sets theory.
21
Lecture 1: Overview of KDD
Lecture 2: Preparing data
Lecture 3: Decision tree induction
Lecture 4: Mining association rules
Lecture 5: Automatic cluster detection
Lecture 6: Artificial neural networks
Lecture 7: Evaluation of discovered knowledge
Brief introduction to lectures
22
Lecture 4: Mining Association Rules
•A new technique and attractive topic. It allows discovering the important associations among items of transactions in a database.
•Content of the lecture
•Prerequisite: Nothing special
3. The Basket Analysis Program
1. Basic Definitions
2. Algorithm Apriori
23
There are two measures of rule strength:
Support of (A => B) = [AB] / N, where N is the number of records in the database.
The support of a rule is the proportion of times the rule applies.
Confidence of (A => B) = [AB] / [A]
The confidence of a rule is the proportion of times the rule is correct.
Measures of Measures of Association RulesAssociation Rules Measures of Measures of Association RulesAssociation Rules
24
• The task of mining association rules is mainly to discover strong association rules (high confidence and strong support) in large databases.
Algorithm AprioriTID Items
1000 A, B, C2000 A, C3000 A, D4000 B, E, F
{A} 75% {B} 50%{C} 50%{A,C} 50%
S = 40%
Large supportitems
• Mining association rules is composed of two steps:
1. discover the large items, i.e., the sets of itemsets that have transaction support above a predetermined minimum support s.
2. Use the large itemsets to generate the association rules
25
Algorithm Apriori: Illustration
TID Items
100 A, C, D200 B, C, E300 A, B, C, E400 B, E
Database D
{A} {B} {C} {D} {E}
Itemset Count
{A} 2 {B} 3{C} 3{E} 3
Itemset Count
{A, B} {A, C} {A, E} {B, C} {B, E}{C, E}
Itemset
{A,B} {A,C} {A,E} {B,C} {B,E} {C,E}
Itemset Count
{A, C} 2 {B, C} 2 {B, E} 3{C, E} 2
Itemset Count
{B, C, E}
Itemset
{B, C, E} 2
Itemset Count
{B, C, E} 2
Itemset Count
C1 L1
C2L2
C2
C3 L3C3ScanD
ScanD
ScanD
S = 40%2
3
3
1
3
1 2 1 2 3 2