2009-11-24-anintroductiontodatamining
TRANSCRIPT
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 1/42
Ling Chen
Slides Courtesyhttp://www.cse.iitb.ac.in/dbms/Data/Talks/datamining-intro-IEP.ppt
http://www-users.cs.umn.edu/~kumar/dmbook/figures/chap1.ppt
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 2/42
` Lots of data is being collected and warehoused
Web data, e-commerce
purchases at department/grocery stores
Bank/Credit Card transactions
` Computers have become cheaper and more powerful
` Competitive Pressure is strong
Provide better, customized services for an edge (e.g. in Customer
Relationship Management)
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 3/42
` Data collected and stored at enormous speeds (GB/hour)
remote sensors on a satellite
telescopes scanning the skies
microarrays generating gene expression data
scientific simulations generating terabytes of data
` Traditional techniques infeasible for raw data
` Data mining may help scientists
in classifying and segmenting data
in Hypothesis Formation
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 4/42
` There is often information hidden in the data that isnot readily evident.
` Human analysts may take weeks to discover useful information.
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 5/42
` Many Definitions Non-trivial extraction of implicit, previously unknown and
potentially useful information from data
Exploration & analysis, by automatic or
semi-automatic means, of large quantities of datain order to discover meaningful patterns
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 6/42
l What is not Data Mining?
o Look up phone number inphone directory
o Query a Web searchengine for informationabout ³Amazon´
What is Data Mining?
o Certain names are moreprevalent in certain US locations
(e.g. O¶Brien, O¶Rurke, O¶Reilly«in Boston area)
o Group together similar documents returned by searchengine according to their context(e.g. Amazon rainforest,
Amazon.com,)
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 7/42
` Draws ideas from machine learning/AI, pattern recognition,
statistics, and database systems
` Traditional Techniques
may be unsuitable due to
Enormity of data
High dimensionality
of data
Heterogeneous,
distributed nature
of data
achine earning/attern
ecognition
Statistics/ AI
Data ining
Database
systems
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 8/42
` Prediction Methods Use some variables to predict unknown or future values
of other variables.
` Description Methods Find human-interpretable patterns that describe the data.
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 9/42
` Classification [Predictive]
` Clustering [Descriptive]
` Association Rule Discovery [Descriptive]
` Sequential Pattern Discovery [Descriptive]
` Regression [Predictive]
` Deviation Detection [Predictive]
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 10/42
Classification (Supervised learning)
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 11/42
Given old data about customers and payments, predict
new applicant¶s loan eligibility.
AgeSalaryProfession
LocationCustomer type
Previous customers Classifier Decision tree
Salary > 5 K
Prof. = Exec
New applicant¶sdata
good/
bad
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 12/42
` Goal: Predict class Ci = f (x1, x2, « , xn)
` Methods:
Regression: (linear or any other polynomial)
e.g., a*x1
+ b*x2
+ c = Ci
. Nearest neighbour
Decision tree classifier: divide decision space into piecewiseconstant regions.
Neural networks: partition by non-linear boundaries
Bayesian Classifiers SVM
«
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 13/42
Define proximity between instances, find neighbors
of new instance and assign majority class
ons
± low d ring application.
± No feat re selection.
± Notion of proximity ag e
ros
ast training
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 14/42
Salary < 1 M
Prof = teacher
Good
Age < 30
BadBad Good
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 15/42
` Widely used learning method
` Easy to interpret: can be re-represented as if-then-elserules
` Does not require any prior knowledge of data
distribution, works well on noisy data.Pros
+ Reasonable training time
+ Fast application
+ Easy to implement
+ Can handle large
number of features
Cons
- Cannot handle complicated
relationship between features
- Simple decision boundaries
- Problems with lots of missingdata
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 16/42
Set of nodes connected by directed weighted edges
Basic NN unitX1 X2 X3 y
1 0 0 -1
1 0 1 1
1 1 0 1
1 1 1 1
0 0 1 -1
0 1 0 -1
0 1 1 1
0 0 0 -1
x1
x2
x3
w1=0.3
w2=0.3
w3=0.3
y
t = 0.4
°®̄
"!
04.03.03.03.0 ,104.03.03.03.0 ,1Ö
321
321
x x xif x x xif y
)...( 11 t xw xw sign y d d !
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 17/42
Hidden nodes
x1
x2
x3Output nodes
Amore typical NN
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 18/42
Useful for learning complex data like handwriting,
speech and image recognition
Neural networkClassification tree
Decision boundaries:
Linear regression
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 19/42
Pros
+ Can learn more complicated
class boundaries
+ Fast application+ Can handle large number of
features
· Cons
- Slow training time
- Hard to interpret
- Hard to implement:trial and error for
choosing number of
nodes
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 20/42
Clustering (Unsupervised Learning)
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 21/42
` Unsupervised learning when old data with class labels
not available
` Group/cluster existing customers based on time series of
payment history such that similar customers locate in the
same cluster.
` Key requirement: Need a good measure of similarity
between instances.
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 22/42
` Numeric data: Euclidean, Manhattan distances
` Categorical data: 0/1 to indicate absence/presence
Hamming distance (# dissimilarity)
Jaccard coefficients: #similarity in 1s/(# of 1s)
data dependent measures: similarity of A and B
depends on co-occurrance with C.
` Combined numeric and categorical data:
weighted normalized distance
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 23/42
` Hierarchical clustering
agglomerative Vs. divisive
single link Vs. complete link
` Partitional clustering
distance-based: K-means
model-based: GMM
density-based: DBSCAN
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 24/42
` Given: matrix of similarity between every point pair
` Start with each point in a separate cluster and merge
clusters based on some criteria:
Single l ink: merge two clusters such that the minimum
distance between two points from the two different
cluster is the least
Complete l ink: merge two clusters such that maximum
distance between two points from the two different
cluster is the least
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 25/42
` Criteria: minimize sum of square of distance between
each point and centroid of the cluster.
` Algorithm:
Randomly select K points as initial centroids
Repeat until stabilization:
x Assign each point to closest centroid
x Generate new cluster centroids
Adjust clusters by merging/splitting
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 26/42
` Strength Easy to use.
Efficient to calculate.
` Weakness Initialization problem
Cannot handle clusters of different densities.
Restricted to data for which there is a notion of a
center/centroid.
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 27/42
Each data point is viewed as an observation from a mixture of
Gaussian distribution.
§!
!
K
j
j j j x P w x P
1
|(|( UU
2
2
2
)(
2
1)|( j
j x
j
j j e x P W
Q
W T
!,where
)|()|( UU i
m
i
x p X p 4!
!
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 28/42
` Strength More general than K-means
Better representation of cluster
Satisfy the statistical assumptions
` Weakness Inefficient in estimating the parameters
How to choose the models
Problems with noises and outliers
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 29/42
Given the radius E ps and the threshold M inP t s
Cor e P oint : the number points within the neighborhood of the point,
defined by E ps, exceeds the threshold M inP t s.
Bor der P oint : not core points, but within a neighborhood of a core point.
Outl i er : neither core points nor border points.
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 30/42
1. Label all points as core, border and outlier points.
2. Eliminate outlier points.
3. Put an edge between all core points that are within
E ps of each other.
4. Make each group of connected core points into a
separate cluster
5. Assign each border point to one of the clusters of its
associated core points (ties may need to be solved).
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 31/42
` Strength Relatively resistant to noise.
Handle clusters of arbitrary shapes and sizes.
` Weakness Problem with clusters having widely varying densities.
Density is more difficult to define with high-dimensional
data.
Expensive in calculating all pairwise proximities.
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 32/42
Association Rules
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 33/42
` Input: a set of groups of items
` Goal: find all rules on itemsets of the form
a-->b such that
Support of a and b > threshold s
Confidence (conditional probability ) of
b given a > threshold c
` Example: milk --> bread
Support(milk, bread) = 2/4
Confidence(milk --> bread) = 2/3
milk, cereal, bread
tea, milk, bread
milk, rice
cereal
Transaction
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 34/42
` Analysts already know about
prevalent rules
` Interesting rules are those
that dev iate from prior
expectation
` Mining¶s payoff is in finding
surpr i sing phenomena
1995
1998
Milk and
cereal sell
together!
Zzzz... Milk and
cereal sell
together!
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 35/42
` Frequent itemset mining /Infrequent itemset mining
` Positive association rules /Negative association rules
` Frequent high dimensional data
Frequent sub-tree mining
Frequent sub-graph mining
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 36/42
Other IssuesOther Issues
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 37/42
` Classification Metric: classification accuracy
Strategy: holdout, random sampling, cross-
validation, bootstrap
` Clustering Cohesion, separation
` Association Rule Mining Efficiency w.r.t. thresholds
Scalability w.r.t. thresholds
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 38/42
` Weka http://www.cs.waikato.ac.nz/ml/weka/
` CLUstering Toolkit (CLUTO)
http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview` SAS, SPSS
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 39/42
` Web data mining
` Biological data mining
` Financial data mining
` Social network data mining
` «.
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 40/42
Questions ?
Thanks!
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 41/42
` Assume a probability model on generation of data.
`
` Apply bayes theorem to find most likely class as:
` Naïve bayes: Assume attributes conditionally
independent given class value` Easy to learn probabilities by counting,
` Useful in some domains e.g. text
)(
)()|(
max)|(max:class predicted d p
c pcd p
d c pcj j
c jc j j !!
!
!n
i
ji
j
cca p
d p
c pc
j 1
|((
(max
8/9/2019 2009-11-24-AnIntroductiontoDataMining
http://slidepdf.com/reader/full/2009-11-24-anintroductiontodatamining 42/42
` "Perhaps the biggest limitation of the support vector approach lies in choice of the kernel."Burgess (1998)
` " A second limitation is speed and size, both in training and testing."Burgess (1998)
` "Discete data presents another problem..."Burgess (1998)
` "...the optimal design for multiclass SVM classifiers is a further area for research."Burgess (1998)
` " Although SVMs have good generalization performance, they can be abysmally slow in test phase, aproblem addressed in (Burges, 1996; Osuna and Girosi, 1998)."Burgess (1998)
` "Besides the advantages of SVMs - from a practical point of view - they have some drawbacks. Animportant practical question that is not entirely solved, is the selection of the kernel function parameters -for Gaussian kernels the width parameter [sigma] - and the value of [epsilon] in the [epsilon]-insensitiveloss function...[more]"Horváth (2003) in Suykens et al.
` "However, from a practical point of view perhaps the most serious problem with SVMs is the highalgorithmic complexity and extensive memory requirements of the required quadratic programming inlarge-scale tasks."
Horváth (2003) in Suykens et al. p 392