k nearest neighbor
Post on 02-Jul-2015
1.007 Views
Preview:
DESCRIPTION
TRANSCRIPT
Classification is done by relating the unknown to the known according to some distance/similarity function
Stores all available cases and classifies new cases based on similarity measure
Different names
Memory-based reasoning
Example-based reasoning
Instance-based reasoning
Case-based reasoning
Lazy learning
kNN determines the decision boundary locally. Ex. for 1NN we assign each document to the class of its closest neighbor
For kNN we assign each document to the majority class of its closest neighbors where k is a parameter
The rationale of kNN classification is based on contiguity hypothesis, we expect the test document to have the same training label as the training documents located in the local region surrounding the document.
Veronoi tessellation of a set of objects decomposes space into Voronoi cells, where each object’s cell consist of all points that are closer to the object than to other objects.
It partitions the plane to complex polygons, each containing its corresponding document.
Let k=3
P(circle class | star) = 1/3
P(X class | star) = 2/3
P(diamond class | star) = 0
3NN estimate is –
P(circle class | star) = 1/3
1NN estimate is –
P(circle class | star) = 1
3NN preferring X class and 1NN preferring circle class
Advantages
Non-parametric architecture
Simple
Powerful
Requires no training time
Disadvantages
Memory intensive
Classification/estimation is slow
The distance is calculated using Euclidean distance
2
21
2
21 )()( yyxxD
MinMax
MinXX s
2
21
2
21 )()( yyxxD
MinMax
MinXX s
If k=1, select the nearest neighbors
If k>1
For classification, select the most frequent neighbors
For regression, calculate the average of k neighbors
An inductive learning task – use particular facts to make more generalized conclusions
Predictive model based on branching series of Boolean test – these Boolean test are less complex than the one-stage classifier
Its learning from class labeled tuples
Can be used as visual aid to structure and solve sequential problems
Internal node (Non-leaf node) denotes a test on an attribute, each branch represents an outcome of the test and each leaf node holds a class label
If we leave at 10 AM and there are no cars stalled on the road, what will our commute time be?
Leave At
Stall? Accident?
10 AM 9 AM8 AM
Long
Long
Short Medium Long
No Yes No Yes
In this decision tree, we made a series of Boolean decision and followed a corresponding branch –
Did we leave at 10AM?
Did the car stall on road?
Is there an accident on the road?
By answering each of these questions as yes or no, we can come to a conclusion on how long our commute might take
We do not have to represent this tree graphically
We can represent this as a set of rules. However, it may be harder to read
if hour == 8am
commute time = long
else if hour == 9am
if accident == yes
commute time = long
else
commute time = medium
else if hour == 10am
if stall == yes
commute time = long
else
commute time = short
The algorithm is called with three parameters – data partition, attribute list, attribute subset selection.
It’s a set of tuples and there associated class label
Attribute list is a list of attributes describing the tuples
Attribute selection method specifies a heuristic procedure for selecting attribute that best discriminates the tuples
Tree starts at node N. if all the tuples in D are of the same class, then node N becomes a leaf and is labelled with that class
Else attribute selection method is used to determine the splitting criteria.
Node N is labelled with splitting criteria, which serves as a test at the node.
The previous experience decision table showed 4 attributes – hour, weather, accident and stall
But the decision tree showed three attributes – hour, attribute and stall
So which attribute is to be kept and which is to be removed?
Methods for selecting attribute shows that weather is not a discriminating attribute
Method – given a number of competing hypothesis, the simplest one is preferable
We will focus on ID3 algorithm
Basic idea
Choose the best attribute to split the remaining instances and make that attribute a decision node
Repeat this process for recursively for each child
Stop when
All attribute have same target attribute value
There are no more attributes
There are no more instances
ID3 splits attributes based on their entropy.
Entropy is a measure of disinformation
Entropy is minimized when all values of target attribute are the same
If we know that the commute time will be short, the entropy=0
Entropy is maximized when there is an equal chance of values for the target attribute (i.e. result is random)
If commute time = short in 3 instances, medium in 3 instances and long in 3 instances, entropy is maximized
Calculation of entropy
Entropy S = ∑(i=1 to l)-|Si|/|S| * log2(|Si|/|S|)
S = set of examples
Si = subset of S with value vi under the target attribute
L – size of range of target attribute
If we break down the leaving time to the minute, we might get something like this
Since the entropy is very less for each branch and we have n branches with n leaves. This would not be helpful for predictive modelling
We use a technique called as discretization. We choose cut point such as 9AM for splitting continuous attributes
8:02 AM 10:02 AM8:03 AM 9:09 AM9:05 AM 9:07 AM
Long Medium Short Long Long Short
Consider the attribute commute time
When we split the attribute, we increase the entropy so we don’t have a decision tree with the same number of cut points as leaves
8:00 (L), 8:02 (L), 8:07 (M), 9:00 (S), 9:20 (S), 9:25 (S), 10:00 (S), 10:02 (M)
Binary decision trees
Classification of an input vector is done by traversing the tree beginning at the root node and ending at the leaf
Each node of the tree computes an inequality
Each leaf is assigned to a particular class
Input space is based on one input variable
Each node draws a boundary that can be geometrically interpreted as a hyperplaneperpendicular to the axis
B CYes No
Yes No
NoYes
BMI<24
They are similar to binary tree
Inequality computed at each node takes on a linear form that may depend on linear variable
aX1+bX2
Yes No
Yes No
NoYes
Chi-squared automatic intersection detector(CHAID)
Non-binary decision tree
Decision made at each node is based on single variable, but can result in multiple branches
Continuous variables are grouped into a finite number of bins to create categories
Equal population bins is created for CHAID
Classification and Regression Trees (CART) are binary decision trees which split a single variable at each node
The CART algorithm goes through an exhaustive search of all variables and split values to find the optimal splitting rule for each node.
There is another technique for reducing the number of attributes used in tree –pruning
Two types of pruning
Pre-pruning (forward pruning)
Post-pruning (backward pruning)
Pre-pruning
We decide during the building process, when to stop adding attributes (possibly based on their information gain)
However, this may be problematic – why?
Sometimes, attribute individually do not compute much to a decision, but combined they may have significant impact.
Post-pruning waits until full decision tree has been built and then prunes the attributes.
Two techniques:
Subtree replacement
Subtree raising
Subtree replacement
A
B
C
1 2 3
4 5
Node 6 replaced the subtree
May increase accuracy
A
B
6 4 5
Entire subtree is raised onto another node
A
B
C
1 2 3
4 5
A
C
1 2 3
While decision tree classifies quickly, the time taken for building the tree may be higher than any other type of classifier.
Decision tree suffer from problem of error propagation throughout the tree
Since decision trees work by a series of local decision, what happens if one of these decision is wrong?
Every decision from that point on may be wrong
We may return to the correct path of the tree
top related