2009-11-24-anintroductiontodatamining

8/9/2019 2009-11-24-AnIntroductiontoDataMining

http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 1/42

Ling Chen

[email protected]

Slides Courtesyhttp://www.cse.iitb.ac.in/dbms/Data/Talks/datamining-intro-IEP.ppt

http://www-users.cs.umn.edu/~kumar/dmbook/figures/chap1.ppt



` Lots of data is being collected and warehoused

Web data, e-commerce

purchases at department/grocery stores

Bank/Credit Card transactions

` Computers have become cheaper and more powerful

` Competitive Pressure is strong

Provide better, customized services for an edge (e.g. in Customer

Relationship Management)



` Data collected and stored at enormous speeds (GB/hour)

remote sensors on a satellite

telescopes scanning the skies

microarrays generating gene expression data

scientific simulations generating terabytes of data

` Traditional techniques infeasible for raw data

` Data mining may help scientists

in classifying and segmenting data

in Hypothesis Formation



` There is often information hidden in the data that isnot readily evident.

` Human analysts may take weeks to discover useful information.



` Many Definitions Non-trivial extraction of implicit, previously unknown and

potentially useful information from data

Exploration & analysis, by automatic or

semi-automatic means, of large quantities of datain order to discover meaningful patterns



l What is not Data Mining?

o Look up phone number inphone directory

o Query a Web searchengine for informationabout ³Amazon´

What is Data Mining?

o Certain names are moreprevalent in certain US locations

(e.g. O¶Brien, O¶Rurke, O¶Reilly«in Boston area)

o Group together similar documents returned by searchengine according to their context(e.g. Amazon rainforest,

Amazon.com,)



` Draws ideas from machine learning/AI, pattern recognition,

statistics, and database systems

` Traditional Techniques

may be unsuitable due to

Enormity of data

High dimensionality

of data

Heterogeneous,

distributed nature

of data

achine earning/attern

ecognition

Statistics/ AI

Data ining

Database

systems



` Prediction Methods Use some variables to predict unknown or future values

of other variables.

` Description Methods Find human-interpretable patterns that describe the data.



` Classification [Predictive]

` Clustering [Descriptive]

` Association Rule Discovery [Descriptive]

` Sequential Pattern Discovery [Descriptive]

` Regression [Predictive]

` Deviation Detection [Predictive]



Classification (Supervised learning)



Given old data about customers and payments, predict

new applicant¶s loan eligibility.

AgeSalaryProfession

LocationCustomer type

Previous customers Classifier Decision tree

Salary > 5 K

Prof. = Exec

New applicant¶sdata

good/

bad



` Goal: Predict class Ci = f (x1, x2, « , xn)

` Methods:

Regression: (linear or any other polynomial)

e.g., a*x1

+ b*x2

+ c = Ci

. Nearest neighbour

Decision tree classifier: divide decision space into piecewiseconstant regions.

Neural networks: partition by non-linear boundaries

Bayesian Classifiers SVM

«



Define proximity between instances, find neighbors

of new instance and assign majority class

ons

± low d ring application.

± No feat re selection.

± Notion of proximity ag e

ros

ast training



Salary < 1 M

Prof = teacher

Good

Age < 30

BadBad Good



` Widely used learning method

` Easy to interpret: can be re-represented as if-then-elserules

` Does not require any prior knowledge of data

distribution, works well on noisy data.Pros

+ Reasonable training time

+ Fast application

+ Easy to implement

+ Can handle large

number of features

Cons

- Cannot handle complicated

relationship between features

- Simple decision boundaries

- Problems with lots of missingdata



Set of nodes connected by directed weighted edges

Basic NN unitX1 X2 X3 y

1 0 0 -1

1 0 1 1

1 1 0 1

1 1 1 1

0 0 1 -1

0 1 0 -1

0 1 1 1

0 0 0 -1

x1

x2

x3

w1=0.3

w2=0.3

w3=0.3

y

t = 0.4

°®̄

"!

04.03.03.03.0 ,104.03.03.03.0 ,1Ö

321

321

x x xif x x xif y

)...( 11 t xw xw sign y d d !



Hidden nodes

x1

x2

x3Output nodes

Amore typical NN



Useful for learning complex data like handwriting,

speech and image recognition

Neural networkClassification tree

Decision boundaries:

Linear regression



Pros

+ Can learn more complicated

class boundaries

+ Fast application+ Can handle large number of

features

· Cons

- Slow training time

- Hard to interpret

- Hard to implement:trial and error for

choosing number of

nodes



Clustering (Unsupervised Learning)



` Unsupervised learning when old data with class labels

not available

` Group/cluster existing customers based on time series of

payment history such that similar customers locate in the

same cluster.

` Key requirement: Need a good measure of similarity

between instances.



` Numeric data: Euclidean, Manhattan distances

` Categorical data: 0/1 to indicate absence/presence

Hamming distance (# dissimilarity)

Jaccard coefficients: #similarity in 1s/(# of 1s)

data dependent measures: similarity of A and B

depends on co-occurrance with C.

` Combined numeric and categorical data:

weighted normalized distance



` Hierarchical clustering

agglomerative Vs. divisive

single link Vs. complete link

` Partitional clustering

distance-based: K-means

model-based: GMM

density-based: DBSCAN



` Given: matrix of similarity between every point pair

` Start with each point in a separate cluster and merge

clusters based on some criteria:

Single l ink: merge two clusters such that the minimum

distance between two points from the two different

cluster is the least

Complete l ink: merge two clusters such that maximum

distance between two points from the two different

cluster is the least



` Criteria: minimize sum of square of distance between

each point and centroid of the cluster.

` Algorithm:

Randomly select K points as initial centroids

Repeat until stabilization:

x Assign each point to closest centroid

x Generate new cluster centroids

Adjust clusters by merging/splitting



` Strength Easy to use.

Efficient to calculate.

` Weakness Initialization problem

Cannot handle clusters of different densities.

Restricted to data for which there is a notion of a

center/centroid.



Each data point is viewed as an observation from a mixture of

Gaussian distribution.

§!

!

K

j

j j j x P w x P

1

|(|( UU

2

2

2

)(

2

1)|( j

j x

j

j j e x P W

Q

W T

!,where

)|()|( UU i

m

i

x p X p 4!

!



` Strength More general than K-means

Better representation of cluster

Satisfy the statistical assumptions

` Weakness Inefficient in estimating the parameters

How to choose the models

Problems with noises and outliers



Given the radius E ps and the threshold M inP t s

Cor e P oint : the number points within the neighborhood of the point,

defined by E ps, exceeds the threshold M inP t s.

Bor der P oint : not core points, but within a neighborhood of a core point.

Outl i er : neither core points nor border points.



1. Label all points as core, border and outlier points.

2. Eliminate outlier points.

3. Put an edge between all core points that are within

E ps of each other.

4. Make each group of connected core points into a

separate cluster

5. Assign each border point to one of the clusters of its

associated core points (ties may need to be solved).



` Strength Relatively resistant to noise.

Handle clusters of arbitrary shapes and sizes.

` Weakness Problem with clusters having widely varying densities.

Density is more difficult to define with high-dimensional

data.

Expensive in calculating all pairwise proximities.



Association Rules



` Input: a set of groups of items

` Goal: find all rules on itemsets of the form

a-->b such that

Support of a and b > threshold s

Confidence (conditional probability ) of

b given a > threshold c

` Example: milk --> bread

Support(milk, bread) = 2/4

Confidence(milk --> bread) = 2/3

milk, cereal, bread

tea, milk, bread

milk, rice

cereal

Transaction



` Analysts already know about

prevalent rules

` Interesting rules are those

that dev iate from prior

expectation

` Mining¶s payoff is in finding

surpr i sing phenomena

1995

1998

Milk and

cereal sell

together!

Zzzz... Milk and

cereal sell

together!



` Frequent itemset mining /Infrequent itemset mining

` Positive association rules /Negative association rules

` Frequent high dimensional data

Frequent sub-tree mining

Frequent sub-graph mining



Other IssuesOther Issues



` Classification Metric: classification accuracy

Strategy: holdout, random sampling, cross-

validation, bootstrap

` Clustering Cohesion, separation

` Association Rule Mining Efficiency w.r.t. thresholds

Scalability w.r.t. thresholds



` Weka http://www.cs.waikato.ac.nz/ml/weka/

` CLUstering Toolkit (CLUTO)

http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview` SAS, SPSS



` Web data mining

` Biological data mining

` Financial data mining

` Social network data mining

` «.



Questions ?

Thanks!



` Assume a probability model on generation of data.

`

` Apply bayes theorem to find most likely class as:

` Naïve bayes: Assume attributes conditionally

independent given class value` Easy to learn probabilities by counting,

` Useful in some domains e.g. text

)(

)()|(

max)|(max:class predicted d p

c pcd p

d c pcj j

c jc j j !!

!

!n

i

ji

j

cca p

d p

c pc

j 1

|((

(max



` "Perhaps the biggest limitation of the support vector approach lies in choice of the kernel."Burgess (1998)

` " A second limitation is speed and size, both in training and testing."Burgess (1998)

` "Discete data presents another problem..."Burgess (1998)

` "...the optimal design for multiclass SVM classifiers is a further area for research."Burgess (1998)

` " Although SVMs have good generalization performance, they can be abysmally slow in test phase, aproblem addressed in (Burges, 1996; Osuna and Girosi, 1998)."Burgess (1998)

` "Besides the advantages of SVMs - from a practical point of view - they have some drawbacks. Animportant practical question that is not entirely solved, is the selection of the kernel function parameters -for Gaussian kernels the width parameter [sigma] - and the value of [epsilon] in the [epsilon]-insensitiveloss function...[more]"Horváth (2003) in Suykens et al.

` "However, from a practical point of view perhaps the most serious problem with SVMs is the highalgorithmic complexity and extensive memory requirements of the required quadratic programming inlarge-scale tasks."

Horváth (2003) in Suykens et al. p 392

2009-11-24-anintroductiontodatamining

Documents