fall 2004data mining1 ie 483/583 knowledge discovery and data mining dr. siggi olafsson fall 2003
Post on 22-Dec-2015
215 Views
Preview:
TRANSCRIPT
Fall 2004 Data Mining 1
IE 483/583Knowledge Discovery and Data Mining
Dr. Siggi Olafsson
Fall 2003
Fall 2004 Data Mining 2
What is Data Mining?
(… and should I be here?)
Fall 2004 Data Mining 3
Dilbert Replies ...
Fall 2004 Data Mining 4
Some Definitions
“Data mining is the extraction of implicit, previously unknown, and potentially
useful information from data.”
“Data mining is the process of exploration and analysis, by automatic or
semiautomatic means, of large quantities of data in order to discover
meaningful patterns and rules.”
Fall 2004 Data Mining 5
• Classification
• Prediction Supervised
• Association discovery
• Clustering Unsupervised
What can Data Mining Do?
Fall 2004 Data Mining 6
Applications of Data Mining
• Manufacturing Process Improvement
• Sales and Marketing
• Mapping the Human Genome
• Diagnosing Breast Cancer
• Financial Crime Identification
• Portfolio Management
Fall 2004 Data Mining 7
Technical Background• Machine Learning
– Data mining: business-oriented use of AI
• Statistics– Regression, sampling, DOE, etc
• Decision Support– Data warehousing, data marts, OLAP, etc
• Interdisciplinary tools put together to form the process of knowledge discovery in databases …
Fall 2004 Data Mining 8
Historical Perspective< 40 Stat Bayes theorem, regression, etc.40s AI Neural networks50s AI Nearest neighbor, single link, perceptron
Stat Resampling, bias reduction, jackknife60s Stat Linear models for classification,
exploratory data analysis (EDA)IR Similarity measures, clusteringDB Relational data model
70s IR Smart IR systemsAI Genetic algorithmsStat EM algorithm, k-means clustering
80s AI Kohonen maps, decision trees90s DB Association rule algorithms, web & search
engines, data warehousing, OLAP
Fall 2004 Data Mining 9
What Changed?
• Very large databases
• Increased computational power as enabler
• Business perspective
Fall 2004 Data Mining 10
Knowledge Discovery in Databases
Databases Data warehouse
Prepared Data
Model/StructuresKnowledge
Data Warehouse Systems Engineering
Knowledge Discovery and Data Mining
Fall 2004 Data Mining 11
Course Information
• We assume data is ready for mining
• Thus, we focus on:– models and structures, and– algorithms
• More information on course homepage
http://www.public.iastate.edu/~olafsson/mining.html
Fall 2004 Data Mining 12
Fall 2004 Data Mining 13
Course Outline• Introduction• Exploratory Data Mining• Supervised Learning• Unsupervised Learning• Optimization Methods in Learning• Selected Advanced Topics
– Mining the Web– Customer Relationship Management (CRM)
• Course Review
Fall 2004 Data Mining 14
Questions?
Fall 2004 Data Mining 15
Data Mining
• Discover patterns in data– automatic or semi-automatic process– meaningful or useful pattern– large amounts of data
• What does such a pattern look like?
Black box Transparent box
Fall 2004 Data Mining 16
Describing Structural Patterns
• Some ways of representing knowledge:– Decision tables– Decision trees– Classification rules– Association rules– Regression trees– Clusters
Fall 2004 Data Mining 17
The Weather ProblemOutlook Temp. Humidity Windy PlaySunny Hot High FALSE NoSunny Hot High TRUE No
Overcast Hot High FALSE YesRainy Mild High FALSE YesRainy Cool Normal FALSE YesRainy Cool Normal TRUE No
Overcast Cool Normal TRUE YesSunny Mild High FALSE NoSunny Cool Normal FALSE YesRainy Mild Normal FALSE YesSunny Mild Normal TRUE Yes
Overcast Mild High TRUE YesOvercast Hot Normal FALSE Yes
Rainy Mild High TRUE No
Fall 2004 Data Mining 18
A Decision List
If outlook = sunny and humidity = high then play = no
If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity = normal then play = yes
If none of the above then play = yes
• These are classification rules
Fall 2004 Data Mining 19
Association Rules
• Many association rules can be inferred:
if temperature = cool then humidity = normal
if humidity = normal and windy = false then play = yes
if outlook = sunny and play = no then humidity = high
Fall 2004 Data Mining 20
Three Layers of the Process
Inputs
Outputs
Algorithms
Fall 2004 Data Mining 21
Inputs
• Three forms– Concepts
• concept description - what you want to learn
– Instances• examples - what you learn from
– Attributes• features of instances - variables you have values for
Fall 2004 Data Mining 22
Concepts: Styles of Learning
• Classification (supervised) learning
• Association learning
• Clustering
• Numeric prediction
Fall 2004 Data Mining 23
Instances: Learn from Examples
• Set of instances to be classified, or associated, or clustered
• Example of concept to be learned• Data set: flat file (single relation)
– denormalization
• Family tree example – concept: sister– example: family tree
Fall 2004 Data Mining 24
Family Tree
S tevenM
G rah amM
P amF
P e te r (M ) = P e g g y (F)
IanM
P ip paF
B rianM
G ra ce (F ) = R a y (M )
A nnaF
N ikk iF
=
Fall 2004 Data Mining 25
Denormalizing Relational DataName Gender Parent1 Parent2 Name Gender Parent1 Parent2 Sister
of?
Steven Male Peter Peggy Pam Female Peter Peggy Yes
Ian Male Grace Ray Pippa Female Grace Ray Yes
Brian Male Grace Ray Pippa Female Grace Ray Yes
Anna Female Pam Ian Nikki Female Pam Ian Yes
Nikki Female Pam Ian Anna Female Pam Ian Yes
Allothers
No
Fall 2004 Data Mining 26
Denormalization Problems
• Computational and storage costs
• Trivial regularities
customers products
product supplier
supplier supplier address
• Infinite relations
Fall 2004 Data Mining 27
Content of Instances: Attributes
• Instance characterized by values of its (predefined) set of attributes– Numeric (“continuous”)– Nominal (categorical)– Ordinal (rank)– Interval– Ratio
Focus in this class
Fall 2004 Data Mining 28
Data Preparation• Data …
– assembly• set of instances/denormalizing relational data
– integration• enterprise-wide database/data warehouse
– cleaning• missing data
– aggregation• good information
Fall 2004 Data Mining 29
ARFF Format
• Used by JAVA package (Weka)
• Independent, unordered instances
• No relationship between instances
Fall 2004 Data Mining 30
Weather Data% ARFF file for the weather data with some numeric features%@relation weather
@attribute outlook { sunny, overcast, rainy }@attribute temperature numeric@attribute humidity numeric@attribute windy { true, false }@attribute play? { yes, no }
@data%% 14 instances%sunny, 85, 85, false, nosunny, 80, 90, true, noovercast, 83, 86, false, yesrainy, 70, 96, false, yesrainy, 68, 80, false, yesrainy, 65, 70, true, noovercast, 64, 65, true, yessunny, 72, 95, false, nosunny, 69, 70, false, yesrainy, 75, 80, false, yessunny, 75, 70, true, yesovercast, 72, 90, true, yesovercast, 81, 75, false, yesrainy, 71, 91, true, no
Fall 2004 Data Mining 31
Features
• % = comments
• @relation <name>
• @attribute <name> <type>– Attribute types: Nominal and numeric
• @data– List of instances– Missing values represented by ?
Fall 2004 Data Mining 32
Other Issues
• Missing data
• Inaccurate values
• Look at the data!!!
Fall 2004 Data Mining 33
Recall the Three Layers of the Data Mining Process
Inputs
Outputs(structural patterns)
Algorithms
Done
Next
Fall 2004 Data Mining 34
Describing Structural Patterns
• Ways of representing knowledge:– Decision tables– Decision trees– Classification rules– Association rules– Regression trees– Clusters
Fall 2004 Data Mining 35
The Weather ProblemOutlook Temp. Humidity Windy PlaySunny Hot High FALSE NoSunny Hot High TRUE No
Overcast Hot High FALSE YesRainy Mild High FALSE YesRainy Cool Normal FALSE YesRainy Cool Normal TRUE No
Overcast Cool Normal TRUE YesSunny Mild High FALSE NoSunny Cool Normal FALSE YesRainy Mild Normal FALSE YesSunny Mild Normal TRUE Yes
Overcast Mild High TRUE YesOvercast Hot Normal FALSE Yes
Rainy Mild High TRUE No
Fall 2004 Data Mining 36
A Decision List
If outlook = sunny and humidity = high then play = no
If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity = normal then play = yes
If none of the above then play = yes
Fall 2004 Data Mining 37
A Decision TreeOutlook
Humidity Windy
Play=No
Sunny RainyOvercast
High
Play=Yes
Play=No
TRUE
Fall 2004 Data Mining 38
Concepts: Styles of Learning
• Classification (supervised) learning
• Association learning
• Clustering
• Numeric prediction
Fall 2004 Data Mining 39
Classification Rules
• Classification easily read off decision trees
• How?
• Other direction possible, but not as straightforward
If a and b then xIf c and d then x
Fall 2004 Data Mining 40
Corresponding Decision Tree
a
b c
c d
d
x
x
x
y
y y
yy
yn
nn
nn
n
Fall 2004 Data Mining 41
Replicated Subtree Problem
X=1
Y=1 Y=1
b
y
ynn
n
aab
If x=1 and y=0 then aIf x=0 and y=1 then aIf x=0 and y=0 then bIf x=1 and y=1 then b
Fall 2004 Data Mining 42
Replicated Subtree Problem
If x=1 and y=1 then aIf z=1 and w=1 then aOtherwise b
x,y,z,w take values 1,2,3
Fall 2004 Data Mining 43
If x and y then a EXCEPT if z then b
Rules with exceptions
• Account for new instances
• Exceptions from exceptions, etc
Fall 2004 Data Mining 44
Association Rules
• Coverage (support): number of instances it predicts correctly• Accuracy (confidence): coverage divided by number of instances it applies to
• Coverage = 4• Accuracy = 100%
If temperature = cool then humidity = normal
Fall 2004 Data Mining 45
InterpretationIf windy = false and play = no then outlook = sunny and humidity = high
If windy = false and play = no then outlook = sunny
If windy = false and play = no then humidity = high
If humidity = high and windy = false and play = no then outlook = sunny
Fall 2004 Data Mining 46
The Shapes Problem
Shaded=standingUnshaded=lying
Fall 2004 Data Mining 47
InstancesWidth Height Sides Class2 4 4 standing3 6 4 standing4 3 4 lying7 8 3 standing7 6 3 lying2 9 4 standing9 1 4 lying10 2 3 lying
Fall 2004 Data Mining 48
Classification Rules
If width 3.5 and height < 7.0 then lyingIf height 3.5 then standing
• Work well to classify these instances
• Problems?
Fall 2004 Data Mining 49
Relational Rules
• Rules comparing attributes to constants are called propositional rules
• Structural patterns?
If width > height then lyingIf height > width then standing
Fall 2004 Data Mining 50
CPU Performance Example
Cycle Cache Performancetime
(ns) min max
MYCT MMIN MMAX CACH CHMIN CHMAX PRP
1 125 256 6000 256 16 128 1982 29 8000 32000 32 8 32 2693 29 8000 32000 32 8 32 2204 29 8000 32000 32 8 32 1725 29 8000 16000 32 8 16 132
…207 125 2000 8000 0 2 14 52208 480 512 8000 32 0 0 67209 480 1000 4000 0 0 0 45
ChannelsMain memory(KB)
Fall 2004 Data Mining 51
Numerical Prediction: regression equation
CHMAX
CHMIN
CACH
MMAX
MMIN
MYCT
PRP
46.1
270.0
630.0
006.0
015.0
049.0
1.56
Fall 2004 Data Mining 52
Regression TreeCHMIN
CACH MMAX
7.5 > 7.5
MMAX 64.6 MMAX
8.5 (8.5,28]>28
- Accuracy?- Large and possibly awkward
Fall 2004 Data Mining 53
Model TreesCHMIN
CACH MMAX
7.5 > 7.5
MMAX LM4
8.5 >8.5
LM5 LM6
28000 > 28000
PRPLM
CHMINMMAXPRPLM
2
77.2004.029.8 1
Fall 2004 Data Mining 54
Instance-Base Representation
• Store actual instances
• New instance: algorithm finds “most similar” stored instance
• Features– What is a similar instance?– Need store (all?) instances– Really a black box method
Fall 2004 Data Mining 55
Clusters:
d ea j c
k h f b
ig
d ea j c
k h f b
ig
Fall 2004 Data Mining 56
Next: Algorithms
top related