last lecture summary. test-data and cross validation
TRANSCRIPT
Test set method• Split the data set into training and test data sets.• Common ration – 70:30• Train the algorithm on training set, assess its
performance on the test set.• Disadvantages– This is simple, however it wastes data.– Test set estimator of performance has high variance
adopted from Cross Validation tutorial, Andrew Moorehttp://www.autonlab.org/tutorials/overfit.html
Train Test
• Training error can not be used as an indicator of model’s performance due to overfitting.
• Training data set - train a range of models, or a given model with a range of values for its parameters.
• Compare them on independent data – Validation set.– If the model design is iterated many times, then
some overfitting to the validation data can occur and so it may be necessary to keep aside a third
• Test set on which the performance of the selected model is finally evaluated.
LOOCV
1. choose one data point2. remove it from the set3. fit the remaining data points4. note your error using the removed data point as
test
Repeat these steps for all points. When you are done report the mean square error (in case of regression).
k-fold crossvalidation
1. randomly break data into k partitions2. remove one partition from the set3. fit the remaining data points4. note your error using the removed partition as test
data set
Repeat these steps for all partitions. When you are done report the mean square error (in case of regression).
Selection and testing• Complete procedure to algorithm selection and
estimation of its quality1. Divide data to train/test
2. By Cross Validation on the Train choose the algorithm
3. Use this algorithm to construct a classifier using Train
4. Estimate its quality on the Test
Train Test
Train
Test
Train Val
Model selection via CV
degree MSEtrain MSE10-fold Choice
1
2
3
4
5
6
adop
ted
from
Cro
ss V
alid
ation
tuto
rial b
y An
drew
Moo
re, h
ttp:
//w
ww
.aut
onla
b.or
g/tu
toria
ls/o
verfi
t.htm
l
polynomial regression
• Similarity sij is quantity that reflects the strength of relationship between two objects or two features.
• Distance dij measures dissimilarity– Dissimilarity measure the discrepancy between
the two objects based on several features.– Distance satisfies the following conditions:• distance is always positive or zero (dij ≥ 0)• distance is zero if and only if it measured to itself• distance is symmetric (dij = dji)
– In addition, if distance satisfies triangular inequality |x+y| ≤ |x|+|y|, then it is called metric.
Distances for quantitative variables
• Minkowski distance (Lp norm)
• distance matrix – matrix with all pairwise distances
1
np
pp i i
i
L x y
p1 p2 p3 p4p1 0 2.828 3.162 5.099p2 2.828 0 1.414 3.162p3 3.162 1.414 0 2p4 5.099 3.162 2 0
k-NN
• supervised learning• target function f may be– dicrete-valued (classification)– real-valued (regression)
• We assign to the class which instance is most similar to the given point.
• k-NN is a lazy learner• lazy learning– generalization beyond the training data is delayed
until a query is made to the system– opposed to eager learning – system tries to
generalize the training data before receiving queries
Which k is best?
Hastie et al., Elements of Statistical Learning
k = 1 k = 15
fitting noise, outliersoverfitting
value not too small smooth out distinctive behavior
Crossvalidation
Real-valued target function
• Algorithm calculates the mean value of the k nearest training examples.
value = 12
value = 14
value = 10
value = (12+14+10)/3 = 12
k = 3
Distance-weighted NN• Give greater weight to closer neighbors
k = 4 unweighted• 2 votes • 2 votes
weighted• 1/12 + 1/22 = 1.25 votes• 1/42 + 1/52 = 0.102 votes
1
2
4
5
k-NN issues
• Curse of dimensionality is a problem.• Significant computation may be required to
process each new query.• To find nearest neighbors one has to evaluate
full distance matrix.• Efficient indexing of stored training examples
helps– kd-tree
• We have data, we don’t know classes.
• Assign data objects into groups (called clusters) so that data objects from the same cluster are more similar to each other than objects from different clusters.
• We have data, we don’t know classes.
• Assign data objects into groups (called clusters) so that data objects from the same cluster are more similar to each other than objects from different clusters.
On clustering validation techniques, M. Halkidi, Y. Batistakis, M. Vazirgiannis
Stages of clustering process
Single linkage(metoda nejbližšího souseda)
based on A Tutorial on Clustering Algorithms http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html
Torino
Milano
Florence
Rome
NaplesBariBA FL MI NA RM TO
BA 0 662 877 255 412 996
FL 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0
BA FL MI/TO NA RM
MI/TO 0
Torino
Milano
Florence
Rome
NaplesBariBA FL MI NA RM TO
BA 0 662 877 255 412 996
FL 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0
996877
BA FL MI/TO NA RM
MI/TO 0877
Torino
Milano
Florence
Rome
NaplesBariBA FL MI NA RM TO
BA 0 662 877 255 412 996
FL 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0
400
295
295877
BA FL MI/TO NA RM
MI/TO 0
Torino
Milano
Florence
Rome
NaplesBariBA FL MI NA RM TO
BA 0 662 877 255 412 996
FL 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0
869
754
BA FL MI/TO NA RM
MI/TO 0295877 754
Torino
Milano
Florence
Rome
NaplesBariBA FL MI NA RM TO
BA 0 662 877 255 412 996
FL 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0
669
564
BA FL MI/TO NA RM
MI/TO 0 754295877 564
Torino
Milano
Florence
Rome
NaplesBari
BA FL MI/TO NA RM
BA 0 662 877 255 412
FL 662 0 295 468 268
MI/TO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
Torino
Milano
Florence
Rome
NaplesBari
BA FL MI/TO NA/RM
BA 0 662 877 255
FL 662 0 295 268
MI/TO 877 295 0 564
NA/RM 255 268 564 0
Torino
Milano
Florence
Rome
NaplesBari
BA/NA/RM FL MI/TO
BA/NA/RM 0 268 564
FL 268 0 295
MI/TO 564 295 0
Torino → Milano Rome → Naples
→ Bari → Florence
Join Torino–Milano and Rome–Naples–Bari–Florence
Dendrogram
MI TOBA NA RM FL
dissimilarity
DendrogramTorino → Milano (138)Rome → Naples (219)
→ Bari (255)→ Florence (268)
Join Torino–Milano and Rome–Naples–Bari–Florence (295)
219 138
255
268
295
MI TOBA NA RM FLdissim
ilarity
Torino
Milano
Florence
Rome
NaplesBari
Torino
Milano
Florence
Rome
NaplesBari
Torino
Milano
Florence
Rome
NaplesBari
Torino
Milano
Florence
Rome
NaplesBariBA FL MI NA RM TO
BA 0 662 877 255 412 996
FL 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0
BA FL MI/TO NA RM
MI/TO 0
Torino
Milano
Florence
Rome
NaplesBariBA FL MI NA RM TO
BA 0 662 877 255 412 996
FL 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0
996877
BA FL MI/TO NA RM
MI/TO 0996
Torino
Milano
Florence
Rome
NaplesBariBA FL MI NA RM TO
BA 0 662 877 255 412 996
FL 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0
400
295
BA FL MI/TO NA RM
MI/TO 0996 400
Torino
Milano
Florence
Rome
NaplesBariBA FL MI NA RM TO
BA 0 662 877 255 412 996
FL 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0
869
754
BA FL MI/TO NA RM
MI/TO 0996 400 869
Torino
Milano
Florence
Rome
NaplesBariBA FL MI NA RM TO
BA 0 662 877 255 412 996
FL 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0
669
564
BA FL MI/TO NA RM
MI/TO 0996 400 869 669
Torino
Milano
Florence
Rome
NaplesBari
BA FL MI/TO NA RM
BA 0 662 996 255 412
FL 662 0 400 468 268
MI/TO 996 400 0 869 669
NA 255 468 869 0 219
RM 412 268 669 219 0
Torino
Milano
Florence
Rome
NaplesBari
BA FL MI/TO NA/RM
BA 0 662 996 412
FL 662 0 400 468
MI/TO 996 400 0 869
NA/RM 412 468 869 0
Torino
Milano
Florence
Rome
NaplesBari
BA MI/TO/FL NA/RM
BA 0 996 412
MI/TO/FL 996 0 869
NA/RM 412 869 0
Torino
Milano
Florence
Rome
NaplesBariBA FL MI NA RM TO
BA 0 662 877 255 412 996
FL 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0
996877
BA FL MI/TO NA RM
MI/TO 0936.5
(996+877)/2=936.5
Torino
Milano
Florence
Rome
NaplesBariBA FL MI NA RM TO
BA 0 662 877 255 412 996
FL 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0
895
BA FL MI/TO NA RM
MI/TO 0895
cluster is represented by its centroid
In Ward’s method metrics are not used, they do not have to be chosen. Instead, sums of squares
(i.e. squared Euclidean distances) between centroids of clusters are computed.
• Ward's method says that the distance between two clusters, A and B, is how much the sum of squares will increase when we merge them.
• At the beginning of clustering, the sum of squares starts out at zero (because every point is in its own cluster) and then grows as we merge clusters.
• Ward‘s method keeps this growth as small as possible.
• hierarchical– groups data with a sequence of nested partitions
• agglomerative– bottom-up– Start with each data point as one cluster, join the clusters up to the
situation when all points form one cluster.• divisive
– top-down– Initially all objects are in one cluster, then the cluster is
subdivided into smaller and smaller pieces.
• partitional– divides data points into some prespecified number of
clusters without the hierarchical structure– i.e. divides the space
Types of clustering
Hierarchical clustering
• Agglomerative methods are used more widely.• Divisive methods need to consider (2N − 1 −1)
possible subset divisions, which is very computationally intensive. – computational difficulties of finding the optimum
partitions• Divisive clustering methods are better at
finding large clusters than hierarchical methods.
Hierarchical clustering
• Disadvantages– High computational complexity – at least O(N2).• Needs to calculate all mutual distances.
– Inability to adjust once the splitting or merging is performed • no undo
k-means
• How to avoid the computing of all mutual distances?
• Calculate distances from representatives (centroids) of clusters.
• Advantage: number of centroids is much lower than the number of data points.
• Disadvantage: number of centroids k must be given in advance
k-means – kids algorithm
• Once there was a land with N houses.• One day K kings arrived to this land.• Each house was taken by the nearest king.• But the community wanted their king to be at the
center of the village, so the throne was moved there.
• Then the kings realized that some houses were closer to them now, so they took those houses, but they lost some.. This went on and on…
• Until one day they couldn't move anymore, so they settled down and lived happily ever after in their village.
• decide on the number of clusters k• randomly initialize k centroids• repeat until convergence (centroids do not
move)– assign each point to the cluster represented by
the centroid it is nearest to– move the centroids to the position given as a
mean of all points in the cluster
k-means – adults algorithm
• Disadvantages:– k must be determined in advance.– Sensitive to initial conditions. The algorithm
minimizes the following “energy” function, but may be trapped in the local minima.
– Applicable only when mean is defined, then what about categorical data? E.g. replace mean with mode (k-modes).
– Arithmetic mean is not robust to outliers (use median – k-medoids).
– Clusters are spherical because the algorithm is based on distance.
2
1||||
K
l Xx lilix