poggi analytics - distance - 1a
TRANSCRIPT
Buenos Aires, marzo de 2016Eduardo Poggi
www.umiacs.umd.edu/~mrastega/
Instance Based Learning
Distancias Introducción k-nearest neighbor Locally weighted regression Radial Basis Functions Case-Based Reasoning Reducción de instancias
Distancias
¿Y si es para algunos en lugar de para todos?
Distancias
Distancias
Distancias
Distancias
Autos Motos Elect. Juguet. Golosinas Trigo Pollos
Autos 1 0.8 0.5 0.2 0.1 0 0
Motos 1 0.5 0.2 0.1 0 0
Elect. 1 0.2 0.1 0 0
Juguet. 1 0.1 0 0
Golosinas 1 0.5 0.5
Trigo 1 0.7
Pollos 1
Distancias
Distancia de Levenshtein, distancia de edición, o distancia entre palabras, al número mínimo de operaciones requeridas para transformar una cadena de caracteres en otra. Se entiende por operación: inserción, eliminación o la sustitución de un carácter.
https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Python
Distancias
www.sc.ehu.es/ccwgrrom/transparencias/pdf-vision-1-transparencias/capitulo-1.pdf
Distancias
http://www.nidokidos.org/threads/29243-Animals-humans-face-similarity-funny-pics!!
Distancias
http://lear.inrialpes.fr/people/nowak/similarity/
DistanciasProducto
Comestibles Limpieza Indumentaria
Animal Vegetal Mineral
Lácteos Cárnicos
Leche liquida Leche fermentada Quesos Manteca
Yogurt entero Yogurt descremado
Yogurt natural Yogurt saborizado
¿IBL?
La idea es simple: La clase de una instancia debe ser similar a la clase asociada e
ejemplos parecidos. Almacenar todo los ejemplos. Cuando se recibe una instancia para clasificar se buscan los
ejemplos “más parecidos” y se analizan las clases asignadas.
Pero: La clasificación puede ser costosa ¿Todos los atributos son igual de relevantes? ¿Cuántos son los ejemplos parecidos? ¿Si los ejemplos parecidos tienen clases disímiles? ¿Todos lso ejemplos parecidos “pesan” igual? ¿Qué tan parecidos deben ser los parecidos?
K-nearest neighbor
To define how similar two examples are we need a metric. We assume all examples are points in an n-dimensional space Rn and
use the Euclidean distance: Let Xi and Xj be two examples. Their distance d(Xi,Xj) is defined as:
d(Xi, Xj) = ( Σk [xik – xjk]2 ) ** 1/2
Where xik is the value of attribute k on example Xi.
K-nearest neighbor for discrete classes
K = 4 New example
Nearest Neighbor
Four things make a memory based learner:
1. A distance metric Euclidian
2. How many nearby neighbors to look at?One
3. A weighting function (optional)Unused
4. How to fit with the local points?Just predict the same output as the nearest neighbor.
Voronoi Diagram
Decision surface induced by a 1-nearest neighbor. The decisionsurface is a combination of convex polyhedra surrounding each training example.
The Zen of Voronoi Diagrams
0 Nearest Neighbor
1 Nearest Neighbor
3 Nearest Neighbor
5 Nearest Neighbor
7 Nearest Neighbor
k-Nearest Neighbour Classification Method
Key idea: keep all the training instances Given query example, take vote amongst its k
neighbours Neighbours are determined by using a distance
function
k-Nearest Neighbour Classification Method
(k=1) (k=4)
Probability interpretation: estimate p(y|x) as
, | , ( )( | ) , ( ) is the neighborhood around
| ( ) |i i i ix y y y x N x
p y x N x xN x
Sample adapted from Rong Jin’s slides
k-Nearest Neighbour Classification Method
Advantages: Training is really fast Can learn complex target functions
Disadvantages Slow at query time: Efficient data structures are needed
to speed up the query
How to choose k?
Use validation with leave-one-out method
For k = 1, 2, …, K
Err(k) = 0;
1. Randomly select a training data point and hide its class label
2. Using the remaining data and given k to predict the class label for the left data point
3. Err(k) = Err(k) + 1 if the predicted label is different from the true label
Repeat the procedure until all training examples are tested
Choose the k whose Err(k) is minimal
How to choose k?
Use validation with leave-one-out method
For k = 1, 2, …, K
Err(k) = 0;
1. Randomly select a training data point and hide its class label
2. Using the remaining data and given k to predict the class label for the left data point
3. Err(k) = Err(k) + 1 if the predicted label is different from the true label
Repeat the procedure until all training examples are tested
Choose the k whose Err(k) is minimal
How to choose k?
Use validation with leave-one-out method
For k = 1, 2, …, K
Err(k) = 0;
1. Randomly select a training data point and hide its class label
2. Using the remaining data and given k to predict the class label for the left data point
3. Err(k) = Err(k) + 1 if the predicted label is different from the true label
Repeat the procedure until all training examples are tested
Choose the k whose Err(k) is minimal
(k=1)
How to choose k?
Use validation with leave-one-out method
For k = 1, 2, …, K
Err(k) = 0;
1. Randomly select a training data point and hide its class label
2. Using the remaining data and given k to predict the class label for the left data point
3. Err(k) = Err(k) + 1 if the predicted label is different from the true label
Repeat the procedure until all training examples are tested
Choose the k whose Err(k) is minimal
Err(1) = 1
How to choose k?
Use validation with leave-one-out method
For k = 1, 2, …, K
Err(k) = 0;
1. Randomly select a training data point and hide its class label
2. Using the remaining data and given k to predict the class label for the left data point
3. Err(k) = Err(k) + 1 if the predicted label is different from the true label
Repeat the procedure until all training examples are tested
Choose the k whose Err(k) is minimal
Err(1) = 1
How to choose k?
Use validation with leave-one-out method
For k = 1, 2, …, K
Err(k) = 0;
1. Randomly select a training data point and hide its class label
2. Using the remaining data and given k to predict the class label for the left data point
3. Err(k) = Err(k) + 1 if the predicted label is different from the true label
Repeat the procedure until all training examples are tested
Choose the k whose Err(k) is minimalErr(1) = 3
Err(2) = 2
Err(3) = 6k = 2
K-nearest neighbor for discrete classes
Algorithm (parameter k) For each training example (X,C(X)) add the example to our
training list. When a new example Xq arrives, assign class:
C(Xq) = majority voting on the k nearest neighbors of Xq C(Xq) = argmax v Σi δ(v, C(Xi))
where δ(a,b) = 1 if a = b and 0 otherwise
K-nearest neighbor for real-valued functions
Algorithm (parameter k)
For each training example (X,C(X)) add the example to our training list.
When a new example Xq arrives, assign class: C(Xq) = average value among k nearest neighbors of Xq C(Xq) = Σ C(Xi) / k
Distance Weighted Nearest Neighbor
It makes sense to weight the contribution of each example according to the distance to the new query example.
C(Xq) = argmax v Σi wi δ(v, C(Xi))
For example, wi = 1 / d(Xq,Xi)
Nearest Neighbor
Four things make a memory based learner:
1. A distance metric Euclidian
2. How many nearby neighbors to look at?k
3. A weighting function (optional)1 / d(Xq,Xi)
4. How to fit with the local points?Just predict the same output as the nearest neighbor.
Distance Weighted Nearest Neighbor for Real-Valued Functions
For real valued functions we average based on the weight function and normalize using the sum of all weights.
C(Xq) = Σi wi C(Xi) / Σ wi
Problems with k-nearest Neighbor
The distance between examples is based on all attributes. What if some attributes are irrelevant?
Consider the curse of dimensionality. The larger the number of irrelevant attributes, the higher the
effect on the nearest-neighbor rule.
One solution is to use weights on the attributes. This is like stretching or contracting the dimensions on the input space.
Ideally we would like to eliminate all irrelevant attributes.
Locally Weighted Regression
Let’s remember some terminology:
Regression: Is a problem similar to classification but the value to predict is a real number.
Residual: The difference between the true target value f and our approximation f’: f(X) – f’(X)
Kernel Function: The distance function that provides a weight to each example. The kernel function K is a function of the distance between examples: K = f(d(Xi,Xq))
Locally Weighted Regression
The method is called locally weighted regression for the following reasons:
“Locally” because the predicted value for an example Xq is based only on the vicinity or neighborhood around Xq.
“Weighted” because the contribution of each neighbor of Xq will depend on the distance between the neighbor example and Xq.
“Regression” because the value to predict will be a real number.
Locally Weighted Regression
Consider the problem of approximating a target function using a linear combination of attribute values:
f’(X) = w0 + w1x1 + w2x2 + … + wnxn where X = (x1, x2, …, xn)
We want to find those coefficients that minimize the error: E = ½ Σk [f(X) – f’(X)]2
Locally Weighted Regression
If we do this in the vicinity of an example Xq and we wish to use a kernel function, we get a form of locally weighted regression:
E(Xq) = ½ Σk ( [f(X) – f’(X)]2 K(d(Xq,X) )
where the sum now goes over the neighbors of Xq.
Locally Weighted Regression
Using gradient descent search, the update rule is defined as:
ΔΔ Wj = n Σk [f(X) – f’(X)] K(d(Xq,X) xj
where n is the learning rate and xj is the jth attribute of example X.
Locally Weighted Regression
Then here are somecommonly usedweighting functions…(we use a Gaussian)
Nearest Neighbor
1. A distance metric Scaled Euclidian
2. How many nearby neighbors to look at?All of them
3. A weighting function (optional)w_k = exp(-D(x_k , x_query )^2 / Kw^2 )
4. How to fit with the local points?First form a local linear model. Find the β that
minimizes the locally weighted sum of squared residuals:
Locally Weighted Regression
Remarks:
The literature contains other functions that are non linear.
There are many variations to locally weighted regression that use different kernel functions.
Normally a linear model is sufficiently good to approximate the local neighborhood of an example.
Reducción de instancias
Reducción de instancias
Reducción de instancias
eduardo-poggi
http://ar.linkedin.com/in/eduardoapoggi
https://www.facebook.com/eduardo.poggi
@eduardoapoggi
Bibliografía