data mining taylor statistics 202: data...
TRANSCRIPT
Statistics 202:Data Mining
c©JonathanTaylor
Statistics 202: Data MiningFinal review
Based in part on slides from textbook, slides of Susan Holmes
c©Jonathan Taylor
December 5, 2012
1 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Final review
Overview – Before Midterm
General goals of data mining.
Datatypes.
Preprocessing & dimension reduction.
Distances.
Multidimensional scaling.
Multidimensional arrays.
Decision trees.
Performance measures for classifiers.
Discriminant analysis.
2 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Final review
Overview – After Midterm
More classifiers:
Rule-based ClassifiersNearest-Neighbour ClassifiersNaive Bayes ClassifiersNeural NetworksSupport Vector MachinesRandom ForestsBoosting (AdaBoost / Gradient Boosting)
Clustering.
Outlier detection.
3 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Rule based classifiers
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3
Rule-based Classifier (Example)
R1: (Give Birth = no) ! (Can Fly = yes) " Birds R2: (Give Birth = no) ! (Live in Water = yes) " Fishes R3: (Give Birth = yes) ! (Blood Type = warm) " Mammals R4: (Give Birth = no) ! (Can Fly = no) " Reptiles R5: (Live in Water = sometimes) " Amphibians 4 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Rule based classifiers
Concepts
coverage
accuracy
mutual exclusivity
exhaustivity
Laplace accuracy
5 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Nearest neighbour classifier
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 34
Nearest Neighbor Classifiers
! Basic idea: – If it walks like a duck, quacks like a duck, then
it’s probably a duck
Training Records
Test Record
Compute Distance
Choose k of the “nearest” records
6 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Nearest neighbour classifier
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 38
Nearest Neighbor Classification…
! Choosing the value of k: – If k is too small, sensitive to noise points – If k is too large, neighborhood may include points from
other classes
7 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Naive Bayes classifiers
Naive Bayes classifiers
Model:
P(Y = c |X1 = xxx1, . . . ,Xp = xxxp)
∝(
p∏l=1
P(Xl = xxx l |Y = c)
)P(Y = c)
For continuous features, typically a 1-dimensional QDAmodel is used (i.e. Gaussian within each class).
For discrete features: use the Laplace smoothedprobabilities
P(Xj = l |Y = c) =# {i : XXX ij = l ,YYY i = c}+ α
# {YYY i = c}+ α · k .
8 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Neural networks: single layer
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 54
Artificial Neural Networks (ANN)
9 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Neural networks: double layer
10 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Support vector machine
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 64
Support Vector Machines
! Find hyperplane maximizes the margin => B1 is better than B2
11 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Support vector machines
Support vector machines
Solves the problem
minimizeβ,α,ξ‖β‖2
subject to yyy i (xxxTi β + α) ≥ 1− ξi , ξi ≥ 0,
∑ni=1 ξi ≤ C
12 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Support vector machines
Non-separable problems
The ξi ’s can be removed from this problem, yielding
minimizeβ,α‖β‖22 + γ
n∑i=1
(1− yi fα,β(xxx i ))+
where (z)+ = max(z , 0) is the positive part function.
Or,
minimizeβ,α
n∑i=1
(1− yi fα,β(xxx i ))+ + λ‖β‖22
13 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Logistic vs. SVM
3 2 1 0 1 2 30.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
LogisticSVM
14 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Ensemble methods
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 74
General Idea
15 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Ensemble methods
Bagging / Random Forests
In this method, one takes several bootstrap samples(samples with replacement) of the data.
For each bootstrap sample Sb, 1 ≤ b ≤ B, fit a model,retaining the classifier f ∗,b.
After all models have been fit, use majority vote
f (xxx) = majority vote of (f ∗,b(xxx))1≤i≤B .
Defined the OOB estimate of error.
16 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Ensemble methods
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 83
Illustrating AdaBoost
Data points for training
Initial weights for each data point
17 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Ensemble methods
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 84
Illustrating AdaBoost
18 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Ensemble methods
Boosting as gradient descent
It turns out that boosting can be thought of as somethinglike gradient descent.
In some sense, the boosting algorithm is a “steepestdescent” algorithm to find
argminf ∈F
n∑i=1
L(yyy i , f (xxx i )).
19 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Cluster analysis
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2
What is Cluster Analysis?
! Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups
Inter-cluster distances are maximized
Intra-cluster distances are
minimized
20 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Clustering
Types of clustering
Partitional A division data objects into non-overlappingsubsets (clusters) such that each data object is inexactly one subset.
Hierarchical A set of nested clusters organized as ahierarchical tree. Each data object is in exactlyone subset for any horizontal cut of the tree . . .
21 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Cluster analysis502 14. Unsupervised Learning
• • •
•
••
••
••••
•
•
• •• ••
•• • •
•
••
•
•
••
•
••
•••••
• ••
•
• •••
•
•
•
••
••
••
•••
•
••
•
•
•
•
•
•
••••
•
•
• •••
••
•• •
•• •• •
•
• •• •
••
•
• •
••
••• •
•
•• • •
•• •• ••• •
••
•••
•••
•
•• •
•
••
••
•
•
••
••
••••
• •
••• ••
X1
X2
FIGURE 14.4. Simulated data in the plane, clustered into three classes (repre-sented by orange, blue and green) by the K-means clustering algorithm
that at each level of the hierarchy, clusters within the same group are moresimilar to each other than those in di!erent groups.
Cluster analysis is also used to form descriptive statistics to ascertainwhether or not the data consists of a set distinct subgroups, each grouprepresenting objects with substantially di!erent properties. This latter goalrequires an assessment of the degree of di!erence between the objects as-signed to the respective clusters.
Central to all of the goals of cluster analysis is the notion of the degree ofsimilarity (or dissimilarity) between the individual objects being clustered.A clustering method attempts to group the objects based on the definitionof similarity supplied to it. This can only come from subject matter consid-erations. The situation is somewhat similar to the specification of a loss orcost function in prediction problems (supervised learning). There the costassociated with an inaccurate prediction depends on considerations outsidethe data.
Figure 14.4 shows some simulated data clustered into three groups viathe popular K-means algorithm. In this case two of the clusters are notwell separated, so that “segmentation” more accurately describes the partof this process than “clustering.” K-means clustering starts with guessesfor the three cluster centers. Then it alternates the following steps untilconvergence:
• for each data point, the closest cluster center (in Euclidean distance)is identified;
A partitional example22 / 1
Statistics 202:Data Mining
c©JonathanTaylor
K -means520 14. Unsupervised Learning
Number of Clusters2 4 6 8
-3.0
-2.5
-2.0
-1.5
-1.0
-0.5
0.0 •
• •
• •• • •
•
••
••
• • •
Number of Clusters
Gap
2 4 6 8
-0.5
0.0
0.5
1.0
•
•
••
• • •
•
log
WK
FIGURE 14.11. (Left panel): observed (green) and expected (blue) values oflog WK for the simulated data of Figure 14.4. Both curves have been translatedto equal zero at one cluster. (Right panel): Gap curve, equal to the di!erencebetween the observed and expected values of log WK . The Gap estimate K! is thesmallest K producing a gap within one standard deviation of the gap at K + 1;here K! = 2.
This gives K! = 2, which looks reasonable from Figure 14.4.
14.3.12 Hierarchical Clustering
The results of applying K-means or K-medoids clustering algorithms de-pend on the choice for the number of clusters to be searched and a startingconfiguration assignment. In contrast, hierarchical clustering methods donot require such specifications. Instead, they require the user to specify ameasure of dissimilarity between (disjoint) groups of observations, basedon the pairwise dissimilarities among the observations in the two groups.As the name suggests, they produce hierarchical representations in whichthe clusters at each level of the hierarchy are created by merging clustersat the next lower level. At the lowest level, each cluster contains a singleobservation. At the highest level there is only one cluster containing all ofthe data.
Strategies for hierarchical clustering divide into two basic paradigms: ag-glomerative (bottom-up) and divisive (top-down). Agglomerative strategiesstart at the bottom and at each level recursively merge a selected pair ofclusters into a single cluster. This produces a grouping at the next higherlevel with one less cluster. The pair chosen for merging consist of the twogroups with the smallest intergroup dissimilarity. Divisive methods startat the top and at each level recursively split one of the existing clusters at
Figure : Gap statistic
23 / 1
Statistics 202:Data Mining
c©JonathanTaylor
K -medoid
Algorithm
Same as K -means, except that centroid is estimated notby the average, but by the observation having minimumpairwise distance with the other cluster members.
Advantage: centroid is one of the observations— useful,eg when features are 0 or 1. Also, one only needs pairwisedistances for K -medoids rather than the raw observations.
24 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Silhouette plot
25 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Cluster analysis
522 14. Unsupervised Learning
CNS
CNS
CNS
RENA
L
BREA
ST
CNSCN
S
BREA
ST
NSCL
C
NSCL
C
RENA
LRE
NAL
RENA
LRENA
LRE
NAL
RENA
L
RENA
L
BREA
STNS
CLC
RENA
L
UNKN
OWN
OVA
RIAN
MELAN
OMA
PROST
ATEOVA
RIAN
OVA
RIAN
OVA
RIAN
OVA
RIAN
OVA
RIAN
PROST
ATE
NSCL
CNS
CLC
NSCL
C
LEUK
EMIA
K562B-repro
K562A-repro
LEUK
EMIA
LEUK
EMIA
LEUK
EMIA
LEUK
EMIA
LEUK
EMIA
COLO
NCO
LON
COLO
NCO
LON
COLO
N
COLO
NCO
LON
MCF
7A-repro
BREA
STMCF
7D-repro
BREA
ST
NSCL
C
NSCL
CNS
CLC
MELAN
OMA
BREA
STBR
EAST
MELAN
OMA
MELAN
OMA
MELAN
OMA
MELAN
OMA
MELAN
OMA
MELAN
OMA
FIGURE 14.12. Dendrogram from agglomerative hierarchical clustering withaverage linkage to the human tumor microarray data.
chical structure produced by the algorithm. Hierarchical methods imposehierarchical structure whether or not such structure actually exists in thedata.
The extent to which the hierarchical structure produced by a dendro-gram actually represents the data itself can be judged by the copheneticcorrelation coe!cient. This is the correlation between the N(N !1)/2 pair-wise observation dissimilarities dii! input to the algorithm and their corre-sponding cophenetic dissimilarities Cii! derived from the dendrogram. Thecophenetic dissimilarity Cii! between two observations (i, i!) is the inter-group dissimilarity at which observations i and i! are first joined togetherin the same cluster.
The cophenetic dissimilarity is a very restrictive dissimilarity measure.First, the Cii! over the observations must contain many ties, since only N!1of the total N(N ! 1)/2 values can be distinct. Also these dissimilaritiesobey the ultrametric inequality
Cii! " max{Cik, Ci!k} (14.40)
A hierarchical example26 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
Concepts
Top-down vs. bottom up
Different linkages:
single linkage (minimum distance)complete linkage (maximum distance)
27 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Mixture models
Mixture models
Similar to K -means but assignment to clusters is “soft”.
Often applied with multivariate normal as the modelwithin classes.
EM algorithm used to fit the model:
Estimate responsibilities.Estimate within class parameters replacing labels(unobserved) with responsibilities.
28 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Model-based clustering
Summary
1 Choose a type of mixture model (e.g. multivariateNormal) and a maximum number of clusters, K
2 Use a specialized hierarchical clustering technique:model-based hierarchical agglomeration.
3 Use clusters from previous step to initialize EM for themixture model.
4 Uses BIC to compare different mixture models and modelswith different numbers of clusters.
29 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
30 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
General steps
Build a profile of the “normal” behavior.
Use these summary statistics to detect anomalies, i.e.points whose characteristics are very far from the normalprofile.
General types of schemes involve a statistical model of“normal”, and “far” is measured in terms of likelihood.
Example: Grubbs’ test chooses an outlier threshold tocontrol Type I error of any declared outliers if data doesactually follow the model . . .
31 / 1
Statistics 202:Data Mining
c©JonathanTaylor
32 / 1