2009-11-24-anintroductiontodatamining

42
8/9/2019 2009-11-24-AnIntroductiontoDataMining http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 1/42 Ling Chen [email protected] Slides Courtesy http://www.cse.iitb.ac.in/dbms/Data/Talks/datamining-intro-IEP.ppt http://www-users.cs.umn.edu/~kumar/dmbook/figures/chap1.ppt

Upload: ferosher22

Post on 30-May-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 1/42

Ling Chen

[email protected]

Slides Courtesyhttp://www.cse.iitb.ac.in/dbms/Data/Talks/datamining-intro-IEP.ppt

http://www-users.cs.umn.edu/~kumar/dmbook/figures/chap1.ppt

Page 2: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 2/42

` Lots of data is being collected and warehoused

Web data, e-commerce

purchases at department/grocery stores

Bank/Credit Card transactions

` Computers have become cheaper and more powerful

` Competitive Pressure is strong

Provide better, customized services for an edge (e.g. in Customer 

Relationship Management)

Page 3: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 3/42

` Data collected and stored at enormous speeds (GB/hour)

remote sensors on a satellite

telescopes scanning the skies

microarrays generating gene expression data

scientific simulations generating terabytes of data

` Traditional techniques infeasible for raw data

` Data mining may help scientists

in classifying and segmenting data

in Hypothesis Formation

Page 4: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 4/42

` There is often information  hidden in the data that isnot readily evident.

` Human analysts may take weeks to discover useful information.

Page 5: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 5/42

` Many Definitions Non-trivial extraction of implicit, previously unknown and

potentially useful information from data

Exploration & analysis, by automatic or 

semi-automatic means, of large quantities of datain order to discover meaningful patterns

Page 6: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 6/42

l What is not Data Mining?

o Look up phone number inphone directory

o Query a Web searchengine for informationabout ³Amazon´

What is Data Mining?

o Certain names are moreprevalent in certain US locations

(e.g. O¶Brien, O¶Rurke, O¶Reilly«in Boston area)

o Group together similar documents returned by searchengine according to their context(e.g. Amazon rainforest,

 Amazon.com,)

Page 7: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 7/42

` Draws ideas from machine learning/AI, pattern recognition,

statistics, and database systems

` Traditional Techniques

may be unsuitable due to

Enormity of data

High dimensionality

of data

Heterogeneous,

distributed nature

of data

achine earning/attern

ecognition

Statistics/ AI

Data ining

Database

systems

Page 8: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 8/42

` Prediction Methods Use some variables to predict unknown or future values

of other variables.

` Description Methods Find human-interpretable patterns that describe the data.

Page 9: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 9/42

` Classification [Predictive]

` Clustering [Descriptive]

`  Association Rule Discovery [Descriptive]

` Sequential Pattern Discovery [Descriptive]

` Regression [Predictive]

` Deviation Detection [Predictive]

Page 10: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 10/42

Classification (Supervised learning)

Page 11: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 11/42

Given old data about customers and payments, predict

new applicant¶s loan eligibility.

 AgeSalaryProfession

LocationCustomer type

Previous customers Classifier  Decision tree

Salary > 5 K

Prof. = Exec

New applicant¶sdata

good/

bad

Page 12: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 12/42

` Goal: Predict class Ci = f (x1, x2, « , xn)

` Methods:

Regression: (linear or any other polynomial)

e.g., a*x1

+ b*x2

+ c = Ci

. Nearest neighbour 

Decision tree classifier: divide decision space into piecewiseconstant regions.

Neural networks: partition by non-linear boundaries

Bayesian Classifiers SVM

«

Page 13: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 13/42

Define proximity between instances, find neighbors

of new instance and assign majority class

ons

 ± low d ring application.

 ± No feat re selection.

  ± Notion of proximity ag e

ros

ast training

Page 14: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 14/42

Salary < 1 M

Prof = teacher 

Good

 Age < 30

BadBad Good

Page 15: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 15/42

` Widely used learning method

` Easy to interpret: can be re-represented as if-then-elserules

` Does not require any prior knowledge of data

distribution, works well on noisy data.Pros

+ Reasonable training time

+ Fast application

+ Easy to implement

+ Can handle large

number of features

Cons

- Cannot handle complicated

relationship between features

- Simple decision boundaries

- Problems with lots of missingdata

Page 16: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 16/42

Set of nodes connected by directed weighted edges

Basic NN unitX1 X2 X3 y

1 0 0 -1

1 0 1 1

1 1 0 1

1 1 1 1

0 0 1 -1

0 1 0 -1

0 1 1 1

0 0 0 -1

x1

x2

x3

w1=0.3

w2=0.3

w3=0.3

y

t = 0.4

°®̄

"!

04.03.03.03.0 ,104.03.03.03.0 ,1Ö

321

321

 x x xif   x x xif   y

)...( 11 t  xw xw sign y d d  !

Page 17: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 17/42

Hidden nodes

x1

x2

x3Output nodes

Amore typical NN

Page 18: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 18/42

Useful for learning complex data like handwriting,

speech and image recognition

Neural networkClassification tree

Decision boundaries:

Linear regression

Page 19: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 19/42

Pros

+ Can learn more complicated

class boundaries

+ Fast application+ Can handle large number of 

features

· Cons

- Slow training time

- Hard to interpret

- Hard to implement:trial and error for 

choosing number of 

nodes

Page 20: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 20/42

Clustering (Unsupervised Learning)

Page 21: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 21/42

` Unsupervised learning when old data with class labels

not available

` Group/cluster existing customers based on time series of 

payment history such that similar customers locate in the

same cluster.

` Key requirement: Need a good measure of similarity

between instances.

Page 22: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 22/42

` Numeric data: Euclidean, Manhattan distances

` Categorical data: 0/1 to indicate absence/presence

Hamming distance (# dissimilarity)

Jaccard coefficients: #similarity in 1s/(# of 1s)

data dependent measures: similarity of A and B

depends on co-occurrance with C.

` Combined numeric and categorical data:

weighted normalized distance

Page 23: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 23/42

` Hierarchical clustering

agglomerative Vs. divisive

single link Vs. complete link

` Partitional clustering

distance-based: K-means

model-based: GMM

density-based: DBSCAN

Page 24: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 24/42

` Given: matrix of similarity between every point pair 

` Start with each point in a separate cluster and merge

clusters based on some criteria:

Single l ink: merge two clusters such that the minimum

distance between two points from the two different

cluster is the least

Complete l ink: merge two clusters such that maximum

distance between two points from the two different

cluster is the least

Page 25: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 25/42

` Criteria: minimize sum of square of distance between

each point and centroid of the cluster.

`  Algorithm:

Randomly select K points as initial centroids

Repeat until stabilization:

x  Assign each point to closest centroid

x Generate new cluster centroids

 Adjust clusters by merging/splitting

Page 26: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 26/42

` Strength Easy to use.

Efficient to calculate.

` Weakness Initialization problem

Cannot handle clusters of different densities.

Restricted to data for which there is a notion of a

center/centroid.

Page 27: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 27/42

Each data point is viewed as an observation from a mixture of 

Gaussian distribution.

§!

!

 K 

 j

 j j j x P w x P 

1

|(|( UU

2

2

2

)(

2

1)|( j

 j x

 j

 j j e x P W 

 Q

W T 

!,where

)|()|( UU i

m

i

 x p X  p 4!

!

Page 28: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 28/42

` Strength More general than K-means

Better representation of cluster 

Satisfy the statistical assumptions

` Weakness Inefficient in estimating the parameters

How to choose the models

Problems with noises and outliers

Page 29: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 29/42

Given the radius E  ps and the threshold M inP t s

Cor e P oint : the number points within the neighborhood of the point,

defined by E  ps, exceeds the threshold M inP t s.

Bor der  P oint : not core points, but within a neighborhood of a core point.

Outl i er : neither core points nor border points.

Page 30: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 30/42

1. Label all points as core, border and outlier points.

2. Eliminate outlier points.

3. Put an edge between all core points that are within

E  ps of each other.

4. Make each group of connected core points into a

separate cluster 

5.  Assign each border point to one of the clusters of its

associated core points (ties may need to be solved).

Page 31: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 31/42

` Strength Relatively resistant to noise.

Handle clusters of arbitrary shapes and sizes.

` Weakness Problem with clusters having widely varying densities.

Density is more difficult to define with high-dimensional

data.

Expensive in calculating all pairwise proximities.

Page 32: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 32/42

Association Rules

Page 33: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 33/42

` Input: a set of groups of items

` Goal: find all rules on itemsets of the form

a-->b such that

Support of a and b > threshold s

Confidence (conditional probability ) of 

b given a > threshold c

` Example: milk --> bread

Support(milk, bread) = 2/4

Confidence(milk --> bread) = 2/3

milk, cereal, bread

tea, milk, bread

milk, rice

cereal

Transaction

Page 34: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 34/42

`  Analysts already know about

prevalent rules

` Interesting rules are those

that dev iate from prior 

expectation

` Mining¶s payoff is in finding

surpr i sing phenomena

1995

1998

Milk and

cereal sell

together!

Zzzz... Milk and

cereal sell

together!

Page 35: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 35/42

` Frequent itemset mining /Infrequent itemset mining

` Positive association rules /Negative association rules

` Frequent high dimensional data

Frequent sub-tree mining

Frequent sub-graph mining

Page 36: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 36/42

Other IssuesOther Issues

Page 37: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 37/42

` Classification Metric: classification accuracy

Strategy: holdout, random sampling, cross-

validation, bootstrap

` Clustering Cohesion, separation

`  Association Rule Mining Efficiency w.r.t. thresholds

Scalability w.r.t. thresholds

Page 38: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 38/42

` Weka http://www.cs.waikato.ac.nz/ml/weka/

` CLUstering Toolkit (CLUTO)

http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview` SAS, SPSS

Page 39: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 39/42

` Web data mining

` Biological data mining

` Financial data mining

` Social network data mining

` «.

Page 40: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 40/42

Questions ?

Thanks!

Page 41: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 41/42

` Assume a probability model on generation of data.

`

` Apply bayes theorem to find most likely class as:

` Naïve bayes: Assume attributes conditionally

independent given class value` Easy to learn probabilities by counting,

` Useful in some domains e.g. text

)(

)()|(

max)|(max:class predicted d  p

c pcd  p

d c pcj j

c jc  j j !!

!

!n

i

 ji

 j

cca p

d  p

c pc

 j 1

|((

(max

Page 42: 2009-11-24-AnIntroductiontoDataMining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 42/42

` "Perhaps the biggest limitation of the support vector approach lies in choice of the kernel."Burgess (1998)

` " A second limitation is speed and size, both in training and testing."Burgess (1998)

` "Discete data presents another problem..."Burgess (1998)

` "...the optimal design for multiclass SVM classifiers is a further area for research."Burgess (1998)

` " Although SVMs have good generalization performance, they can be abysmally slow in test phase, aproblem addressed in (Burges, 1996; Osuna and Girosi, 1998)."Burgess (1998)

` "Besides the advantages of SVMs - from a practical point of view - they have some drawbacks. Animportant practical question that is not entirely solved, is the selection of the kernel function parameters -for Gaussian kernels the width parameter [sigma] - and the value of [epsilon] in the [epsilon]-insensitiveloss function...[more]"Horváth (2003) in Suykens et al.

` "However, from a practical point of view perhaps the most serious problem with SVMs is the highalgorithmic complexity and extensive memory requirements of the required quadratic programming inlarge-scale tasks."

Horváth (2003) in Suykens et al. p 392