linear separators
DESCRIPTION
Linear Separators. Bankruptcy example. R is the ratio of earnings to expenses L is the number of late payments on credit cards over the past year. We would like here to draw a linear separator , and get so a classifier. 1-Nearest Neighbor Boundary. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Linear Separators](https://reader030.vdocument.in/reader030/viewer/2022033017/56815b61550346895dc94829/html5/thumbnails/1.jpg)
Linear Separators
![Page 2: Linear Separators](https://reader030.vdocument.in/reader030/viewer/2022033017/56815b61550346895dc94829/html5/thumbnails/2.jpg)
Bankruptcy exampleR is the ratio of earnings to expenses L is the number of late payments on credit cards over the past year.We would like here to draw a linear separator, and get so a classifier.
![Page 3: Linear Separators](https://reader030.vdocument.in/reader030/viewer/2022033017/56815b61550346895dc94829/html5/thumbnails/3.jpg)
1-Nearest Neighbor Boundary• The decision boundary will be the boundary between cells
defined by points of different classes, as illustrated by the bold line shown here.
![Page 4: Linear Separators](https://reader030.vdocument.in/reader030/viewer/2022033017/56815b61550346895dc94829/html5/thumbnails/4.jpg)
Decision Tree BoundarySimilarly, a decision tree also defines a decision boundary in the feature space.
Although both 1-NN and decision trees agree on all the training points, they disagree on the precise decision boundary and so will classify some query points differently.
This is the essential difference between different learning algorithms.
![Page 5: Linear Separators](https://reader030.vdocument.in/reader030/viewer/2022033017/56815b61550346895dc94829/html5/thumbnails/5.jpg)
Linear Boundary• Linear separators are
characterized by a single linear decision boundary in the space. – The bankruptcy data can be
successfully separated in that manner.
• But, there is no guarantee that a single linear separator will successfully classify any set of training data.
![Page 6: Linear Separators](https://reader030.vdocument.in/reader030/viewer/2022033017/56815b61550346895dc94829/html5/thumbnails/6.jpg)
Linear Hypothesis Class• Line equation (assume 2D first):
w2x2+w1x1+b=0
• Fact1: All the points (x1, x2) lying on the line make the equation true.
• Fact2: The line separates the plane in two half-planes.
• Fact3: The points (x1, x2) in one half-plane give us an inequality with respect to 0, which has the same direction for each of the points in the half-plane.
• Fact4: The points (x1, x2) in the other half-plane give us the reverse inequality with respect to 0.
![Page 7: Linear Separators](https://reader030.vdocument.in/reader030/viewer/2022033017/56815b61550346895dc94829/html5/thumbnails/7.jpg)
Fact 3 proofw2x2+w1x1+b=0
We can write it as:
21
2
12 w
bx
w
wx
x1
x2
0
(p,q)
(p,r)
(p,r) is on the line so:
22
1
w
bp
w
wr
But q<r, so we get:22
1
w
bp
w
wrq
Since (p,q) was an arbitrary point in the half-plane, we say that the same direction of inequality holds for any other point of the half-plane.
0 if 0
0 if 0
212
212
wbpwqw
wbpwqw
i.e.
![Page 8: Linear Separators](https://reader030.vdocument.in/reader030/viewer/2022033017/56815b61550346895dc94829/html5/thumbnails/8.jpg)
Fact 4 proofw2x2+w1x1+b=0
We can write it as:
21
2
12 w
bx
w
wx
(p,r) is on the line so:
22
1
w
bp
w
wr
But s>r, so we get:22
1
w
bp
w
wrs
Since (p,s) was an arbitrary point in the (other) half-plane, we say that the same direction of inequality holds for any other point of that half-plane.
x1
x2
0
(p,r)
(p,s)
0 if 0
0 if 0
212
212
wbpwsw
wbpwsw
i.e.
![Page 9: Linear Separators](https://reader030.vdocument.in/reader030/viewer/2022033017/56815b61550346895dc94829/html5/thumbnails/9.jpg)
Corollary• What’s an easy way to determine the direction of the
inequalities for each subplane?– Try it for the point (0,0), and determine the direction for the half-
plane where (0,0) belongs.
– The points of the other half-plane will have the opposite inequality direction.
• How much bigger (or smaller) than zero is w2p+w1q+b is proportional to the distance of the point (p,q) from the line.
• The same can be said for an n-dimensional space. Simply, we don’t talk about “half-planes” but “half-spaces” (line is now hyperplane creating two half-spaces)
![Page 10: Linear Separators](https://reader030.vdocument.in/reader030/viewer/2022033017/56815b61550346895dc94829/html5/thumbnails/10.jpg)
Linear classifier• We can now exploit the sign of this distance to define a linear classifier, one
whose decision boundary is a hyperplane.
• Instead of using 0 and 1 as the class labels (which was an arbitrary choice anyway) we use the sign of the distance, either +1 or -1 as the labels (that is the values of the yi ’s).
)()( bsignh xwx
Which outputs +1 or –1.
)()(
0or 0
isequation hyperplane then the
and 1let
0
00
xwx
xw
signh
xw
bwx
n
jjj
Trick
![Page 11: Linear Separators](https://reader030.vdocument.in/reader030/viewer/2022033017/56815b61550346895dc94829/html5/thumbnails/11.jpg)
Margin• The margin is the product of
w.xi for the training point xi and the known sign of the class, yi.
margin: i = yiw.xi
is proportional to perpendicular distance of point xi to line (hyperplane).
i > 0 : point is correctly classified (sign of distance = yi)
i < 0 : point is incorrectly classified (sign of distance yi)
![Page 12: Linear Separators](https://reader030.vdocument.in/reader030/viewer/2022033017/56815b61550346895dc94829/html5/thumbnails/12.jpg)
Perceptron algorithm• How to find a linear separator?
• Perceptron algorithm, was developed by Rosenblatt in the mid 50's.
• This is a greedy, "mistake driven" algorithm.
Algorithm
• Pick initial weight vector (including b), e.g. [.1, …, .1]
• Repeat until all points get correctly classified
• Repeat for each point xi
– Calculate margin yi.w.xi (this is number)
– If margin > 0, point xi is correctly classified
– Else, change weights proportional to yi.xi
![Page 13: Linear Separators](https://reader030.vdocument.in/reader030/viewer/2022033017/56815b61550346895dc94829/html5/thumbnails/13.jpg)
Gradient Ascent/Descent• Why pick yi.xi as increment to weights?
• The margin is a multiple input variable function. – The variables are w2, w1, w0 (or in general wn,…,w0)
• In order to reach the maximum of this function, it is good to change the variables in the direction of the slope of the function.
• The slope is represented by the gradient of the function. – The gradient is the vector of first (partial) derivatives of the
function with respect to each of the input variables.
ii
ii
yf
yf
x
xww
w
)(
![Page 14: Linear Separators](https://reader030.vdocument.in/reader030/viewer/2022033017/56815b61550346895dc94829/html5/thumbnails/14.jpg)
Perceptron algorithm• Changes for the different points interfere with each other.
– So, it will not be the case that one pass through the points will produce a correct
weight vector. – In general, we will have to go around multiple times.
• However, the algorithm is guaranteed to terminate with the weights for a separating hyperplane as long as the data is linearly separable. – The proof of this fact is beyond our scope.
• Notice that if the data is not separable, then this algorithm is an infinite loop. – Good idea to keep track of the best separator we've seen so far.
![Page 15: Linear Separators](https://reader030.vdocument.in/reader030/viewer/2022033017/56815b61550346895dc94829/html5/thumbnails/15.jpg)
Perceptron algorithm Bankruptcy data• 49 iterations through the
bankruptcy data for the algorithm to stop.
• The separator at the end of the loop is [0.4, 0.94, -2.2]
• We can pick some small "rate" constant to scale the change to w. This is called eta.
![Page 16: Linear Separators](https://reader030.vdocument.in/reader030/viewer/2022033017/56815b61550346895dc94829/html5/thumbnails/16.jpg)
Dual Form• The calculated w will be:
m
iiii y
1
xw
where, i is the number of times data instance xi got missclassified.
• So, for classification we’ll check:
m
i
iii ysignsignh
1
)()( xxxwx
where x is the new data instance to e classified.
![Page 17: Linear Separators](https://reader030.vdocument.in/reader030/viewer/2022033017/56815b61550346895dc94829/html5/thumbnails/17.jpg)
Perceptron algorithm = 0
• Repeat until all points get correctly classified
• Repeat for each point xi
– Calculate margin
– If margin > 0, point xi is correctly classified
– Else, increment i .
If data is not linearly separable then alphas grow without bound
m
j
ijjj y
1
xx
![Page 18: Linear Separators](https://reader030.vdocument.in/reader030/viewer/2022033017/56815b61550346895dc94829/html5/thumbnails/18.jpg)
Non-linearly separable
![Page 19: Linear Separators](https://reader030.vdocument.in/reader030/viewer/2022033017/56815b61550346895dc94829/html5/thumbnails/19.jpg)
Moving points into a different space
• Square every x1 and x2 value first.
– A point that was at (-1,2) would now be at (1,4), – A point that was at (0.5,1) would now be at (0.25,1), and so on.
Very easy now to
divide X's from O's.
![Page 20: Linear Separators](https://reader030.vdocument.in/reader030/viewer/2022033017/56815b61550346895dc94829/html5/thumbnails/20.jpg)
Main IdeaTransform the points (vectors) into another space using some function
and then do linear separation in the new space, i.e. considering vectors
(x1), (x2), ..., (xn).
![Page 21: Linear Separators](https://reader030.vdocument.in/reader030/viewer/2022033017/56815b61550346895dc94829/html5/thumbnails/21.jpg)
The Kernel Trick• While you could write code to transform the data into a new
space like this, it isn't usually done in practice because finding a dividing line when working with real datasets can require casting the data into hundreds or thousands of dimensions, and this is quite impractical to implement.
• However, with any algorithm that uses dot-products—including the linear classifier—you can use a technique called the kernel trick.
• The kernel trick involves replacing the dot-product function with a new function that returns what the dot-product would have been if the data had first been transformed to a higher dimensional space using some mapping function.
![Page 22: Linear Separators](https://reader030.vdocument.in/reader030/viewer/2022033017/56815b61550346895dc94829/html5/thumbnails/22.jpg)
The Kernel TrickRemember, all we care is computing dot products.
See something interesting:• Let : R2 R3 such that
(x) = ([x1, x2]) = [z1, z2 , z3] = [x12, 2x1x2, x2
2]
• Now, let r = [r1, r2, r3] and s = [s1, s2, s3] be two vectors in R3 corresponding to vectors a = [a1, a2] and b = [b1, b2] in R2.
(a) (b) = rs =
r1s1+r2s2+r3s3 =
(a1b1)2 + 2a1a2b1b2 + (a2b2)2 =
(a1b1 +a2b2)2 =
(ab)2
![Page 23: Linear Separators](https://reader030.vdocument.in/reader030/viewer/2022033017/56815b61550346895dc94829/html5/thumbnails/23.jpg)
The Kernel Trick• So instead of mapping the data vectors via and computing the
modified inner product (a) (b), we can do it in one operation, leaving the mapping completely implicit.
• Because “modified inner product” is a long name, we call it a kernel, K(a, b) = (a) (b).
Useful Kernels
• Polynomial Kernel: K(a, b) = (ab)2
– Visualization: http://www.youtube.com/watch?v=3liCbRZPrZA
• Gaussian Kernel: K(a, b) = e(1/2)||x−y||^2
![Page 24: Linear Separators](https://reader030.vdocument.in/reader030/viewer/2022033017/56815b61550346895dc94829/html5/thumbnails/24.jpg)
Line Separators
It's difficult to characterize the separator that the Perceptron algorithm will come up with.
Different runs can come up with different separators.
Can we do better?
![Page 25: Linear Separators](https://reader030.vdocument.in/reader030/viewer/2022033017/56815b61550346895dc94829/html5/thumbnails/25.jpg)
Which one to pick?• Natural choice: Pick the separator that has the maximal margin to its closest points on either side.
– Most conservative. – Any other separator will be "closer" to one class than to the other.
Those closest points are called "support vectors".