data mining demystified

44
Data Mining Data Mining Demystified Demystified John Aleshunas John Aleshunas Fall Faculty Institute Fall Faculty Institute October 2006 October 2006

Upload: nen

Post on 08-Jan-2016

46 views

Category:

Documents


4 download

DESCRIPTION

Data Mining Demystified. John Aleshunas Fall Faculty Institute October 2006. Prediction is very hard, especially when it's about the future. - Yogi Berra. Data Mining Stories. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Mining Demystified

Data Mining DemystifiedData Mining Demystified

John AleshunasJohn Aleshunas

Fall Faculty InstituteFall Faculty Institute

October 2006October 2006

Page 2: Data Mining Demystified

Prediction is very hard, especially when it's about the future.

- Yogi Berra

Page 3: Data Mining Demystified

Data Mining StoriesData Mining Stories

““My bank called and said that they saw that I My bank called and said that they saw that I bought two surfboards at Laguna Beach, bought two surfboards at Laguna Beach, California.” - credit card fraud detectionCalifornia.” - credit card fraud detection

The NSA is using data mining to analyze The NSA is using data mining to analyze telephone call data to track al’Qaeda activitiestelephone call data to track al’Qaeda activities

Victoria’s Secret uses data mining to control Victoria’s Secret uses data mining to control product distribution based on typical customer product distribution based on typical customer buying patterns at individual storesbuying patterns at individual stores

Page 4: Data Mining Demystified

PreviewPreview

Why data mining?Why data mining?

Example data setsExample data sets

Data mining methodsData mining methods

Example application of data miningExample application of data mining

Social issues of data miningSocial issues of data mining

Page 5: Data Mining Demystified

Source: HanSource: Han

Why Data Mining?Why Data Mining?

Database systems have been around since Database systems have been around since the 1970sthe 1970s

Organizations have a vast digital history of Organizations have a vast digital history of the day-to-day pieces of their processesthe day-to-day pieces of their processes

Simple queries no longer provide Simple queries no longer provide satisfying resultssatisfying results They take too long to executeThey take too long to execute They cannot help us find new opportunitiesThey cannot help us find new opportunities

Page 6: Data Mining Demystified

Source: HanSource: Han

Why Data Mining?Why Data Mining?

Data doubles about every year while Data doubles about every year while useful information seems to be useful information seems to be decreasingdecreasing

Vast data stores overload traditional Vast data stores overload traditional decision making processesdecision making processes

We are data rich, but information poorWe are data rich, but information poor

Page 7: Data Mining Demystified

Data Mining: a definitionData Mining: a definition

Simply stated, data mining refers to Simply stated, data mining refers to the extraction of knowledge from the extraction of knowledge from large amounts of data.large amounts of data.

Page 8: Data Mining Demystified

Source: DunhamSource: Dunham

Data Mining ModelsData Mining ModelsA TaxonomyA Taxonomy

Data Mining

Predictive Descriptive

Classification

Regression

Time SeriesAnalysis

Prediction SummarizationSequenceDiscovery

Clustering AssociationRules

Page 9: Data Mining Demystified

Example DatasetsExample Datasets

IrisIris

WineWine

DiabetesDiabetes

Page 10: Data Mining Demystified

Source: FisherSource: Fisher

Iris DatasetIris Dataset

Created by R.A. Fisher (1936)Created by R.A. Fisher (1936)

150 instances150 instances

Three cultivars (Setosa, Virginica, Versicolor) 50 Three cultivars (Setosa, Virginica, Versicolor) 50 instances eachinstances each

4 measurements (petal width, petal length, sepal 4 measurements (petal width, petal length, sepal width, sepal length)width, sepal length)

One cultivar (Setosa) is easily separable, the One cultivar (Setosa) is easily separable, the others are not – noisy dataothers are not – noisy data

Page 11: Data Mining Demystified

Iris Dataset AnalysisIris Dataset Analysis

Petal Width

0

0.5

1

1.5

2

2.5

3

0 10 20 30 40 50 60

Number of Records (Integer)(Figure 4)

Iris-Setosa

Iris-Versicolor

Iris-Virginica

Sepal Width

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 10 20 30 40 50 60

Number of Records (Integers)(Figure 2)

Sep

al W

idth

(cm

)

Iris-Setosa

Iris-Versicolor

Iris-Verginica

Page 12: Data Mining Demystified

Source: UCI Machine Learning RSource: UCI Machine Learning Repository epository

Wine DatasetWine Dataset

This data is the result of a chemical This data is the result of a chemical analysis of wines grown in the same analysis of wines grown in the same region in Italy but derived from three region in Italy but derived from three different varieties.different varieties.

153 instances with 13 constituents 153 instances with 13 constituents found in each of the three types of found in each of the three types of wines.wines.

Page 13: Data Mining Demystified

Wine Dataset AnalysisWine Dataset Analysis

Flavinoids

0

1

2

3

4

5

6

0 10 20 30 40 50 60 70

Instance

Val

ue

Class 1

Class 2

Class 3

Ash

0

0.5

1

1.5

2

2.5

3

3.5

0 10 20 30 40 50 60 70

Instances

Val

ues

Class 1

Class 2

Class 3

Page 14: Data Mining Demystified

Source: UCI Machine Learning RSource: UCI Machine Learning Repository epository

Diabetes DatasetDiabetes Dataset

Data is based on a population of women who Data is based on a population of women who were at least 21 years old of Pima Indian heritage were at least 21 years old of Pima Indian heritage and living near Phoenix in 1990and living near Phoenix in 1990

768 instances768 instances

9 attributes (Pregnancies, PG Concentration, 9 attributes (Pregnancies, PG Concentration, Diastolic BP, Tri Fold Thick, Serum Ins, BMI, DP Diastolic BP, Tri Fold Thick, Serum Ins, BMI, DP Function, Age, Diabetes)Function, Age, Diabetes)

Dataset has many missing values, only 532 Dataset has many missing values, only 532 instances are completeinstances are complete

Page 15: Data Mining Demystified

Diabetes Dataset AnalysisDiabetes Dataset Analysis

PG Concentration

0

50

100

150

200

250

0 100 200 300 400 500 600

Instances

Val

ues Healthy

Sick

Diastlic BP

0

20

40

60

80

100

120

140

0 100 200 300 400 500 600

Instances

Val

ues Healthy

Sick

Page 16: Data Mining Demystified

ClassificationClassification

Classification builds a model using a Classification builds a model using a training dataset with known classes training dataset with known classes of dataof data

That model is used to classify new, That model is used to classify new, unknown data into those classesunknown data into those classes

Page 17: Data Mining Demystified

Classification TechniquesClassification Techniques

K-Nearest NeighborsK-Nearest Neighbors

Decision Tree Classification (ID3, Decision Tree Classification (ID3, C4.5)C4.5)

Page 18: Data Mining Demystified

K-Nearest Neighbors K-Nearest Neighbors ExampleExample

A A

A

A

A

A

B

B

B

B

BX

A

A

A

A

BBB

B

A

B

• Easy to explain

• Simple to implement

• Sensitive to the selection of the classification population

• Not always conclusive for complex data

Page 19: Data Mining Demystified

Source: IndelicatoSource: Indelicato

K-Nearest Neighbors K-Nearest Neighbors ExampleExample

MISCLASSIFICATION PERCENTAGMISCLASSIFICATION PERCENTAGESES

Iris DatasetIris Dataset

All AttributesAll Attributes Petal Length and Petal WidthPetal Length and Petal Width

SetosaSetosa 0/150 = 0%0/150 = 0% 0/150 = 0%0/150 = 0%

VersicolorVersicolor 0/150 = 0%0/150 = 0% 0/150 = 0%0/150 = 0%

VirginicaVirginica 9/150 = 6%9/150 = 6% 7/150 = 4.67%7/150 = 4.67%

TotalTotal 6%6% 4.67%4.67%

Wine DatasetWine Dataset

All AttributesAll Attributes Phenols, Flavanoids, OD280/OD315Phenols, Flavanoids, OD280/OD315

Class 1Class 1 0/153 = 0%0/153 = 0% 2/153 = 1.31%2/153 = 1.31%

Class 2Class 2 9/153 = 5.88%9/153 = 5.88% 30/153 = 19.61%30/153 = 19.61%

Class 3Class 3 0/153 = 0%0/153 = 0% 0/153 = 0%0/153 = 0%

TotalTotal 5.88%5.88% 20.92%20.92%

Page 20: Data Mining Demystified

Decision Tree Example Decision Tree Example (C4.5)(C4.5)

C4.5 is a decision tree generating algorithm, based on the C4.5 is a decision tree generating algorithm, based on the ID3 algorithm. It contains several improvements, especially ID3 algorithm. It contains several improvements, especially needed for software implementation. needed for software implementation.

Choice of best splitting attribute is based on an entropy Choice of best splitting attribute is based on an entropy calculation.calculation.

These improvements include:These improvements include: Choosing an appropriate attribute selection measure. Choosing an appropriate attribute selection measure. Handling training data with missing attribute values. Handling training data with missing attribute values. Handling attributes with differing costs. Handling attributes with differing costs. Handling continuous attributes.Handling continuous attributes.

Page 21: Data Mining Demystified

Source: SiedlerSource: Siedler

Decision Tree Example Decision Tree Example (C4.5)(C4.5)

Iris dataset Wine dataset

Accuracy 97.67% Accuracy 86.7%

Page 22: Data Mining Demystified

Decision Tree Example Decision Tree Example (C4.5)(C4.5)

C4.5 produces a complex tree (195 nodes)C4.5 produces a complex tree (195 nodes)

The simplified (pruned) tree reduces the classification accuracyThe simplified (pruned) tree reduces the classification accuracy

Diabetes dataset

Before PruningBefore Pruning After PruningAfter Pruning

SizeSize ErrorsErrors SizeSize ErrorsErrors

195195 40 (5.2%)40 (5.2%) 6969 102 (13.3%)102 (13.3%)

AccuracyAccuracy 94.8%94.8% 86.7%86.7%

Page 23: Data Mining Demystified

Association RulesAssociation Rules

Association rules are used to show the Association rules are used to show the relationships between data items. relationships between data items.

Purchasing one product when another Purchasing one product when another product is purchased is an example of an product is purchased is an example of an association rule.association rule.

They do not represent any causality or They do not represent any causality or correlation.correlation.

Page 24: Data Mining Demystified

Association Rule TechniquesAssociation Rule Techniques

Market Basket AnalysisMarket Basket Analysis

TerminologyTerminology

Transaction databaseTransaction database

Association rule – implication {A, B} Association rule – implication {A, B} ═> {C}═> {C}

Support - % of transactions in which {A, B, C} occursSupport - % of transactions in which {A, B, C} occurs

Confidence – ratio of the number of transactions that Confidence – ratio of the number of transactions that contain {A, B, C}contain {A, B, C} to the number of transactions that to the number of transactions that contain contain {A, B}{A, B}

Page 25: Data Mining Demystified

Source: UCI Machine Learning RSource: UCI Machine Learning Repository epository

Association Rule ExampleAssociation Rule Example1984 United States Congressional Voting Records Database

Attribute Information:

1. Class Name: 2 (democrat, republican) 2. handicapped-infants: 2 (y,n) 3. water-project-cost-sharing: 2 (y,n) 4. adoption-of-the-budget-resolution: 2 (y,n) 5. physician-fee-freeze: 2 (y,n) 6. El-Salvador-aid: 2 (y,n) 7. religious-groups-in-schools: 2 (y,n) 8. anti-satellite-test-ban: 2 (y,n) 9. aid-to-Nicaraguan-contras: 2 (y,n) 10. MX-missile: 2 (y,n) 11. immigration: 2 (y,n) 12. synfuels-corporation-cutback: 2 (y,n) 13. education-spending: 2 (y,n) 14. superfund-right-to-sue: 2 (y,n) 15. crime: 2 (y,n) 16. duty-free-exports: 2 (y,n) 17. export-administration-act-south-africa: 2 (y,n)

Rules:

{budget resolution = no, MX-missile = no,aid to El Salvador = yes} {Republican} confidence 91.0%

{budget resolution = yes, MX-missile = yes,aid to El Salvador = no} {Democrat} confidence 97.5%

{crime = yes, right-to-sue = yes,Physician fee freeze = yes} {Republican} confidence 93.5%

{crime = no, right-to-sue = no,Physician fee freeze = no} {Democrat} confidence 100.0%

Page 26: Data Mining Demystified

ClusteringClustering

Clustering is similar to classification in that Clustering is similar to classification in that data are grouped. data are grouped.

Unlike classification, the groups are not Unlike classification, the groups are not predefined; they are discovered.predefined; they are discovered.

Grouping is accomplished by finding Grouping is accomplished by finding similarities between data according to similarities between data according to characteristics found in the actual data.characteristics found in the actual data.

Page 27: Data Mining Demystified

Clustering TechniquesClustering Techniques

K-Means ClusteringK-Means Clustering

Neural Network Clustering (SOM)Neural Network Clustering (SOM)

Page 28: Data Mining Demystified

K-Means ExampleK-Means Example

The K-Means algorithm is an method The K-Means algorithm is an method to cluster objects based on their to cluster objects based on their attributes into k partitions. attributes into k partitions.

It assumes that the It assumes that the kk clusters exhibit clusters exhibit normal distributions. normal distributions.

The objective it tries to achieve is to The objective it tries to achieve is to minimize the variance within the minimize the variance within the clusters.clusters.

Page 29: Data Mining Demystified

K-Means ExampleK-Means Example

Cluster 1 Cluster 2 Cluster 3

Dataset

Mean 3Mean 2Mean 1

Page 30: Data Mining Demystified

K-Means ExampleK-Means Example

Cluster 1Cluster 1 Cluster 2Cluster 2 Cluster 3Cluster 346 Versicolor46 Versicolor

3 Virginica3 Virginica

Cluster mean 4.22857Cluster mean 4.22857

4 Versicolor4 Versicolor

47 Virginica47 Virginica

Cluster mean 5.55686Cluster mean 5.55686

50 Setosa50 Setosa

Cluster mean 1.46275Cluster mean 1.46275

Cluster 1Cluster 1 Cluster 2Cluster 2 Cluster 3Cluster 347 Versicolor47 Versicolor

49 Virginica49 Virginica

Mean 6.30, 2.89, 4.96, 1.70Mean 6.30, 2.89, 4.96, 1.70

21 Setosa21 Setosa

1 Virginica1 Virginica

Mean 4.59, 3.07, 1.44, 0.29Mean 4.59, 3.07, 1.44, 0.29

29 Setosa29 Setosa

3 Versicolor3 Versicolor

Mean 5.21, 3.53, 1.67, 0.35Mean 5.21, 3.53, 1.67, 0.35

Iris dataset, only the petal width attribute, Accuracy 95.33%Iris dataset, only the petal width attribute, Accuracy 95.33%

Iris dataset, all attributes, Accuracy 66.0Iris dataset, all attributes, Accuracy 66.0 %%

Cluster Cluster 11

Cluster Cluster 22

Cluster Cluster 33

Cluster Cluster 44

Cluster Cluster 55

Cluster Cluster 66

Cluster Cluster 77

23 23 VirginicaVirginica

1 Virginica1 Virginica 26 Setosa26 Setosa 12 12 VirginicaVirginica

24 24 VersicolorVersicolor

1 Virginica1 Virginica

26 26 VersicolorVersicolor

13 13 VirginicaVirginica

24 Setosa24 Setosa

Iris dataset, all attributes, Accuracy 90.67Iris dataset, all attributes, Accuracy 90.67 %%

Page 31: Data Mining Demystified

Self-Organizing Map Self-Organizing Map ExampleExample

The Self-Organizing Map was first described by the Finnish The Self-Organizing Map was first described by the Finnish professor Teuvo Kohonen and is thus sometimes referred to as professor Teuvo Kohonen and is thus sometimes referred to as a Kohonen map. a Kohonen map.

SOM is especially good for visualizing high-dimensional data.SOM is especially good for visualizing high-dimensional data.

SOM maps input vectors onto a two-dimensional grid of nodes. SOM maps input vectors onto a two-dimensional grid of nodes.

Nodes that are close together have similar attribute values Nodes that are close together have similar attribute values and nodes that are far apart have different attribute values.and nodes that are far apart have different attribute values.

Page 32: Data Mining Demystified

Z

YX

Self-Organizing Map Self-Organizing Map ExampleExample

Input Vectors

ZY

X

Page 33: Data Mining Demystified

Self-Organizing Map Self-Organizing Map ExampleExample

Virginica Virginica Virginica   Versicolor   Setosa Setosa Setosa Setosa

Virginica Virginica Virginica   Versicolor     Setosa   Setosa

Virginica Virginica Virginica Versicolor Versicolor   Setosa Setosa Setosa Setosa

Virginica   Virginica   Versicolor     Setosa Setosa Setosa

Virginica Virginica Virginica   Versicolor Versicolor   Setosa Setosa Setosa

Virginica     Versicolor Versicolor Versicolor Versicolor     Setosa

Virginica   Virginica Versicolor Versicolor   Versicolor Versicolor Versicolor  

Virginica Virginica Virginica Versicolor Versicolor Versicolor Versicolor Versicolor Versicolor Versicolor

Virginica Versicolor Versicolor Versicolor Versicolor Versicolor Versicolor Versicolor Versicolor  

Virginica   Virginica Versicolor   Versicolor Virginica Versicolor Versicolor  

Iris Data

Page 34: Data Mining Demystified

Self-Organizing Map Self-Organizing Map ExampleExample

Class-2 Class-2 Class-2 Class-2 Class-3 Class-2   Class-2   Class-3

Class-2 Class-2 Class-2 Class-2     Class-3 Class-3 Class-2 Class-3

Class-2 Class-2   Class-3 Class-2 Class-2 Class-2   Class-3  

Class-2 Class-3 Class-3 Class-3 Class-3   Class-3   Class-1  

Class-3 Class-3 Class-2 Class-3     Class-3 Class-3 Class-2 Class-1

Class-3 Class-3 Class-2 Class-3 Class-3 Class-3        

    Class-3 Class-3     Class-1   Class-1  

        Class-1 Class-1 Class-1 Class-1   Class-1

Class-2 Class-1 Class-1 Class-3     Class-1   Class-1  

  Class-2     Class-2 Class-1 Class-1 Class-1   Class-1

Wine Data

Page 35: Data Mining Demystified

Self-Organizing Map Self-Organizing Map ExampleExample

Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy

Sick Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy

Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy

Healthy Healthy Sick Healthy Sick Healthy Healthy Healthy Healthy Healthy

Healthy Healthy Healthy   Healthy Healthy Healthy Healthy Healthy Healthy

Healthy Healthy Sick Healthy Healthy Healthy Healthy Healthy Healthy Healthy

Sick Sick Healthy Sick Sick Healthy Sick Healthy Healthy Healthy

Sick   Healthy Healthy Sick Sick Healthy Healthy Healthy Healthy

Sick   Sick Healthy Sick Healthy Sick Sick Healthy Healthy

Sick Healthy Sick Sick Sick Sick Sick Sick Healthy Sick

Diabetes Data

Page 36: Data Mining Demystified

Source: McKeeSource: McKee

NFL Quarterback AnalysisNFL Quarterback Analysis

Data from 2005 for 42 NFL Data from 2005 for 42 NFL quarterbacksquarterbacks

Preprocessed data to normalize for a Preprocessed data to normalize for a full 16 game regular seasonfull 16 game regular season

Used SOM to cluster individuals based Used SOM to cluster individuals based on performance and descriptive dataon performance and descriptive data

Page 37: Data Mining Demystified

Source: McKeeSource: McKee

NFL Quarterback AnalysisNFL Quarterback Analysis

The SOM Map

Page 38: Data Mining Demystified

Source: McKeeSource: McKee

NFL Quarterback AnalysisNFL Quarterback Analysis

QB Passing Rating Overall Clustering

Page 39: Data Mining Demystified

Source: McKeeSource: McKee

NFL Quarterback AnalysisNFL Quarterback Analysis

The SOM Map

Page 40: Data Mining Demystified

Data Mining Stories - Data Mining Stories - RevisitedRevisited

Credit card fraud detectionCredit card fraud detection

NSA telephone network analysisNSA telephone network analysis

Supply chain managementSupply chain management

Page 41: Data Mining Demystified

Social Issues of Data MiningSocial Issues of Data Mining

Impacts on personal privacy and confidentialityImpacts on personal privacy and confidentiality

Classification and clustering is similar to profilingClassification and clustering is similar to profiling

Association rules resemble logical implicationsAssociation rules resemble logical implications

Data mining is an imperfect process subject to Data mining is an imperfect process subject to interpretationinterpretation

Page 42: Data Mining Demystified

ConclusionConclusion

Why data mining?Why data mining?

Example data setsExample data sets

Data mining methodsData mining methods

Example application of data miningExample application of data mining

Social issues of data miningSocial issues of data mining

Page 43: Data Mining Demystified

What on earth would a man do with himself if something did not stand in his way? - H.G. Wells

I don’t think necessity is the mother of invention – invention, in my opinion, arises directly from idleness, probably also from laziness, to save oneself trouble. - Agatha Christie, from “An Autobiography, Pt III, Growing Up”

Page 44: Data Mining Demystified

ReferencesReferences

Dunham, Margaret, Data Mining Introductory and Advanced Topics, Pearson Dunham, Margaret, Data Mining Introductory and Advanced Topics, Pearson Education, Inc., 2003Education, Inc., 2003

Fisher, R.A., Fisher, R.A., The Use of Multiple Measurements in Taxonomic ProblemsThe Use of Multiple Measurements in Taxonomic Problems, Annals of , Annals of Eugenics Eugenics 77, pp. 179-188 , pp. 179-188

Han, Jiawei, Data Mining: Concepts and Techniques, Elsevier Inc., 2006Han, Jiawei, Data Mining: Concepts and Techniques, Elsevier Inc., 2006

Indelicato, Nicolas, Analysis of the K-Nearest Neighbors Algorithm, MATH 4500: Indelicato, Nicolas, Analysis of the K-Nearest Neighbors Algorithm, MATH 4500: Foundations of Data Mining, 2004Foundations of Data Mining, 2004

McKee , Kevin, McKee , Kevin, The Self Organized Map Applied to 2005 NFL Quarterbacks, MATH MATH 4200: Data Mining Foundations, 20064200: Data Mining Foundations, 2006

Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science Irvine, CA: University of California, Department of Information and Computer Science

Seidler, Toby, The C4.5 Project: An Overview of the Algorithm with Results of Seidler, Toby, The C4.5 Project: An Overview of the Algorithm with Results of Experimentation, MATH 4500: Foundations of Data Mining, 2004 Experimentation, MATH 4500: Foundations of Data Mining, 2004