fall 2004data mining1 ie 483/583 knowledge discovery and data mining dr. siggi olafsson fall 2003

Post on 22-Dec-2015

215 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Fall 2004 Data Mining 1

IE 483/583Knowledge Discovery and Data Mining

Dr. Siggi Olafsson

Fall 2003

Fall 2004 Data Mining 2

What is Data Mining?

(… and should I be here?)

Fall 2004 Data Mining 3

Dilbert Replies ...

Fall 2004 Data Mining 4

Some Definitions

“Data mining is the extraction of implicit, previously unknown, and potentially

useful information from data.”

“Data mining is the process of exploration and analysis, by automatic or

semiautomatic means, of large quantities of data in order to discover

meaningful patterns and rules.”

Fall 2004 Data Mining 5

• Classification

• Prediction Supervised

• Association discovery

• Clustering Unsupervised

What can Data Mining Do?

Fall 2004 Data Mining 6

Applications of Data Mining

• Manufacturing Process Improvement

• Sales and Marketing

• Mapping the Human Genome

• Diagnosing Breast Cancer

• Financial Crime Identification

• Portfolio Management

Fall 2004 Data Mining 7

Technical Background• Machine Learning

– Data mining: business-oriented use of AI

• Statistics– Regression, sampling, DOE, etc

• Decision Support– Data warehousing, data marts, OLAP, etc

• Interdisciplinary tools put together to form the process of knowledge discovery in databases …

Fall 2004 Data Mining 8

Historical Perspective< 40 Stat Bayes theorem, regression, etc.40s AI Neural networks50s AI Nearest neighbor, single link, perceptron

Stat Resampling, bias reduction, jackknife60s Stat Linear models for classification,

exploratory data analysis (EDA)IR Similarity measures, clusteringDB Relational data model

70s IR Smart IR systemsAI Genetic algorithmsStat EM algorithm, k-means clustering

80s AI Kohonen maps, decision trees90s DB Association rule algorithms, web & search

engines, data warehousing, OLAP

Fall 2004 Data Mining 9

What Changed?

• Very large databases

• Increased computational power as enabler

• Business perspective

Fall 2004 Data Mining 10

Knowledge Discovery in Databases

Databases Data warehouse

Prepared Data

Model/StructuresKnowledge

Data Warehouse Systems Engineering

Knowledge Discovery and Data Mining

Fall 2004 Data Mining 11

Course Information

• We assume data is ready for mining

• Thus, we focus on:– models and structures, and– algorithms

• More information on course homepage

http://www.public.iastate.edu/~olafsson/mining.html

Fall 2004 Data Mining 12

Fall 2004 Data Mining 13

Course Outline• Introduction• Exploratory Data Mining• Supervised Learning• Unsupervised Learning• Optimization Methods in Learning• Selected Advanced Topics

– Mining the Web– Customer Relationship Management (CRM)

• Course Review

Fall 2004 Data Mining 14

Questions?

Fall 2004 Data Mining 15

Data Mining

• Discover patterns in data– automatic or semi-automatic process– meaningful or useful pattern– large amounts of data

• What does such a pattern look like?

Black box Transparent box

Fall 2004 Data Mining 16

Describing Structural Patterns

• Some ways of representing knowledge:– Decision tables– Decision trees– Classification rules– Association rules– Regression trees– Clusters

Fall 2004 Data Mining 17

The Weather ProblemOutlook Temp. Humidity Windy PlaySunny Hot High FALSE NoSunny Hot High TRUE No

Overcast Hot High FALSE YesRainy Mild High FALSE YesRainy Cool Normal FALSE YesRainy Cool Normal TRUE No

Overcast Cool Normal TRUE YesSunny Mild High FALSE NoSunny Cool Normal FALSE YesRainy Mild Normal FALSE YesSunny Mild Normal TRUE Yes

Overcast Mild High TRUE YesOvercast Hot Normal FALSE Yes

Rainy Mild High TRUE No

Fall 2004 Data Mining 18

A Decision List

If outlook = sunny and humidity = high then play = no

If outlook = rainy and windy = true then play = no

If outlook = overcast then play = yes

If humidity = normal then play = yes

If none of the above then play = yes

• These are classification rules

Fall 2004 Data Mining 19

Association Rules

• Many association rules can be inferred:

if temperature = cool then humidity = normal

if humidity = normal and windy = false then play = yes

if outlook = sunny and play = no then humidity = high

Fall 2004 Data Mining 20

Three Layers of the Process

Inputs

Outputs

Algorithms

Fall 2004 Data Mining 21

Inputs

• Three forms– Concepts

• concept description - what you want to learn

– Instances• examples - what you learn from

– Attributes• features of instances - variables you have values for

Fall 2004 Data Mining 22

Concepts: Styles of Learning

• Classification (supervised) learning

• Association learning

• Clustering

• Numeric prediction

Fall 2004 Data Mining 23

Instances: Learn from Examples

• Set of instances to be classified, or associated, or clustered

• Example of concept to be learned• Data set: flat file (single relation)

– denormalization

• Family tree example – concept: sister– example: family tree

Fall 2004 Data Mining 24

Family Tree

S tevenM

G rah amM

P amF

P e te r (M ) = P e g g y (F)

IanM

P ip paF

B rianM

G ra ce (F ) = R a y (M )

A nnaF

N ikk iF

=

Fall 2004 Data Mining 25

Denormalizing Relational DataName Gender Parent1 Parent2 Name Gender Parent1 Parent2 Sister

of?

Steven Male Peter Peggy Pam Female Peter Peggy Yes

Ian Male Grace Ray Pippa Female Grace Ray Yes

Brian Male Grace Ray Pippa Female Grace Ray Yes

Anna Female Pam Ian Nikki Female Pam Ian Yes

Nikki Female Pam Ian Anna Female Pam Ian Yes

Allothers

No

Fall 2004 Data Mining 26

Denormalization Problems

• Computational and storage costs

• Trivial regularities

customers products

product supplier

supplier supplier address

• Infinite relations

Fall 2004 Data Mining 27

Content of Instances: Attributes

• Instance characterized by values of its (predefined) set of attributes– Numeric (“continuous”)– Nominal (categorical)– Ordinal (rank)– Interval– Ratio

Focus in this class

Fall 2004 Data Mining 28

Data Preparation• Data …

– assembly• set of instances/denormalizing relational data

– integration• enterprise-wide database/data warehouse

– cleaning• missing data

– aggregation• good information

Fall 2004 Data Mining 29

ARFF Format

• Used by JAVA package (Weka)

• Independent, unordered instances

• No relationship between instances

Fall 2004 Data Mining 30

Weather Data% ARFF file for the weather data with some numeric features%@relation weather

@attribute outlook { sunny, overcast, rainy }@attribute temperature numeric@attribute humidity numeric@attribute windy { true, false }@attribute play? { yes, no }

@data%% 14 instances%sunny, 85, 85, false, nosunny, 80, 90, true, noovercast, 83, 86, false, yesrainy, 70, 96, false, yesrainy, 68, 80, false, yesrainy, 65, 70, true, noovercast, 64, 65, true, yessunny, 72, 95, false, nosunny, 69, 70, false, yesrainy, 75, 80, false, yessunny, 75, 70, true, yesovercast, 72, 90, true, yesovercast, 81, 75, false, yesrainy, 71, 91, true, no

Fall 2004 Data Mining 31

Features

• % = comments

• @relation <name>

• @attribute <name> <type>– Attribute types: Nominal and numeric

• @data– List of instances– Missing values represented by ?

Fall 2004 Data Mining 32

Other Issues

• Missing data

• Inaccurate values

• Look at the data!!!

Fall 2004 Data Mining 33

Recall the Three Layers of the Data Mining Process

Inputs

Outputs(structural patterns)

Algorithms

Done

Next

Fall 2004 Data Mining 34

Describing Structural Patterns

• Ways of representing knowledge:– Decision tables– Decision trees– Classification rules– Association rules– Regression trees– Clusters

Fall 2004 Data Mining 35

The Weather ProblemOutlook Temp. Humidity Windy PlaySunny Hot High FALSE NoSunny Hot High TRUE No

Overcast Hot High FALSE YesRainy Mild High FALSE YesRainy Cool Normal FALSE YesRainy Cool Normal TRUE No

Overcast Cool Normal TRUE YesSunny Mild High FALSE NoSunny Cool Normal FALSE YesRainy Mild Normal FALSE YesSunny Mild Normal TRUE Yes

Overcast Mild High TRUE YesOvercast Hot Normal FALSE Yes

Rainy Mild High TRUE No

Fall 2004 Data Mining 36

A Decision List

If outlook = sunny and humidity = high then play = no

If outlook = rainy and windy = true then play = no

If outlook = overcast then play = yes

If humidity = normal then play = yes

If none of the above then play = yes

Fall 2004 Data Mining 37

A Decision TreeOutlook

Humidity Windy

Play=No

Sunny RainyOvercast

High

Play=Yes

Play=No

TRUE

Fall 2004 Data Mining 38

Concepts: Styles of Learning

• Classification (supervised) learning

• Association learning

• Clustering

• Numeric prediction

Fall 2004 Data Mining 39

Classification Rules

• Classification easily read off decision trees

• How?

• Other direction possible, but not as straightforward

If a and b then xIf c and d then x

Fall 2004 Data Mining 40

Corresponding Decision Tree

a

b c

c d

d

x

x

x

y

y y

yy

yn

nn

nn

n

Fall 2004 Data Mining 41

Replicated Subtree Problem

X=1

Y=1 Y=1

b

y

ynn

n

aab

If x=1 and y=0 then aIf x=0 and y=1 then aIf x=0 and y=0 then bIf x=1 and y=1 then b

Fall 2004 Data Mining 42

Replicated Subtree Problem

If x=1 and y=1 then aIf z=1 and w=1 then aOtherwise b

x,y,z,w take values 1,2,3

Fall 2004 Data Mining 43

If x and y then a EXCEPT if z then b

Rules with exceptions

• Account for new instances

• Exceptions from exceptions, etc

Fall 2004 Data Mining 44

Association Rules

• Coverage (support): number of instances it predicts correctly• Accuracy (confidence): coverage divided by number of instances it applies to

• Coverage = 4• Accuracy = 100%

If temperature = cool then humidity = normal

Fall 2004 Data Mining 45

InterpretationIf windy = false and play = no then outlook = sunny and humidity = high

If windy = false and play = no then outlook = sunny

If windy = false and play = no then humidity = high

If humidity = high and windy = false and play = no then outlook = sunny

Fall 2004 Data Mining 46

The Shapes Problem

Shaded=standingUnshaded=lying

Fall 2004 Data Mining 47

InstancesWidth Height Sides Class2 4 4 standing3 6 4 standing4 3 4 lying7 8 3 standing7 6 3 lying2 9 4 standing9 1 4 lying10 2 3 lying

Fall 2004 Data Mining 48

Classification Rules

If width 3.5 and height < 7.0 then lyingIf height 3.5 then standing

• Work well to classify these instances

• Problems?

Fall 2004 Data Mining 49

Relational Rules

• Rules comparing attributes to constants are called propositional rules

• Structural patterns?

If width > height then lyingIf height > width then standing

Fall 2004 Data Mining 50

CPU Performance Example

Cycle Cache Performancetime

(ns) min max

MYCT MMIN MMAX CACH CHMIN CHMAX PRP

1 125 256 6000 256 16 128 1982 29 8000 32000 32 8 32 2693 29 8000 32000 32 8 32 2204 29 8000 32000 32 8 32 1725 29 8000 16000 32 8 16 132

…207 125 2000 8000 0 2 14 52208 480 512 8000 32 0 0 67209 480 1000 4000 0 0 0 45

ChannelsMain memory(KB)

Fall 2004 Data Mining 51

Numerical Prediction: regression equation

CHMAX

CHMIN

CACH

MMAX

MMIN

MYCT

PRP

46.1

270.0

630.0

006.0

015.0

049.0

1.56

Fall 2004 Data Mining 52

Regression TreeCHMIN

CACH MMAX

7.5 > 7.5

MMAX 64.6 MMAX

8.5 (8.5,28]>28

- Accuracy?- Large and possibly awkward

Fall 2004 Data Mining 53

Model TreesCHMIN

CACH MMAX

7.5 > 7.5

MMAX LM4

8.5 >8.5

LM5 LM6

28000 > 28000

PRPLM

CHMINMMAXPRPLM

2

77.2004.029.8 1

Fall 2004 Data Mining 54

Instance-Base Representation

• Store actual instances

• New instance: algorithm finds “most similar” stored instance

• Features– What is a similar instance?– Need store (all?) instances– Really a black box method

Fall 2004 Data Mining 55

Clusters:

d ea j c

k h f b

ig

d ea j c

k h f b

ig

Fall 2004 Data Mining 56

Next: Algorithms

top related