nearest neighbour and clustering

8/3/2019 Nearest Neighbour and Clustering

1/122

Nearest Neighbor and Clustering


2/122


3/122

Nearest Neighbour and Clustering

Nearest Neighbor Clustering

Used for prediction as well as

consolidation.

Used mostly for consolidating data into a

high-level view and general grouping of

records into like behaviors.

Space is defined by the problem

to be solved (supervised learning).

Space is defined as default n-

dimensional space, or is defined by the

user, or is a predefined space driven by

past experience (unsupervised learning).

Generally only uses distance

metrics to determine nearness.

Can use other metrics besides distance to

determine nearness of two records - for

example linking two points together.


4/122

K Nearest Neighbors

K Nearest Neighbors

Advantage Nonparametric architecture

Simple

Powerful

Requires no training time

Disadvantage

Memory intensive

Classification/estimation is slow


5/122

K Nearest Neighbors

The key issues involved in training this

model includes setting

the variableK

Validation techniques (ex. Cross validation)

the type of distant metric

Euclidean measure

2

1

)(),( !

!D

i

YiXiYXDist


6/122

Figure K Nearest Neighbors Example

X

Stored training set patterns

X input pattern for classification

--- Euclidean distance measure tothe nearest three patterns


7/122

Store all input data in the training set

For each pattern in the test set

Search for the K nearest patterns to the

input pattern using a Euclidean distance

measure

For classification, compute the confidence for

each class as Ci/K,

(where Ci is the number ofpatterns among

the K nearest patterns belonging to class i.)

The classification for the input pattern is the

class with the highest confidence.


8/122

Training parameters and typical

settings Number of nearest neighbors

The numbers of nearest neighbors (K) should be

based on cross validation over a number ofK setting.

When k=1 is a good baseline model to benchmarkagainst.

A good rule-of-thumb numbers is k should be less

than the square root of the total number of training

patterns.


9/122

Training parameters and typical

settings Input compression

Since KNN is very storage intensive, we may want tocompress data patterns as a preprocessing stepbefore classification.

Using input compression will result in slightly worseperformance.

Sometimes using compression will improveperformance because it performs automatic

normalization of the data which can equalize theeffect of each input in the Euclidean distancemeasure.


10/122

Nearest Neighbour and Clustering

Oldest techniques used in DM

Like records are grouped or clustered

together and put into same grouping Nearest neighbor prediction tech quite

close to clustering

To find prediction value in one record, lookfor similar records with similarpredictor

values in the historical DB


11/122

Use the prediction value of the recordwhich is nearest to the unknown rec

Ex:laundry uses clustering

In Business, clusters-more dynamic

Which cluster a rec falls, may changedaily, monthly

Therefore is difficult to decide Another ex NN:income group of

neighbours


12/122

Best way to predict an unknown persons

income possibly choose the closest

persons

Nearest neighbourprediction alg works on

DB very much same way

Many factors-nearest condn

Persons locn,school attended,degree

attained etc..


13/122

Business Score Card

Measures critical to business successdeals with: ease of deployment, real worldproblems avoiding serious mistakes as

well as achieving big successes DM tech needs to be :easy to use, deploy

in an automated fashion as possible

Provide clear understandable answers Provide answer that can be converted into

ROI


14/122

BSC

Automation: NN are relatively automated,

although some preprocessing is performed

in converting predictors into values that

can be used in a measure of distance

Unordered categorical predictors (eye

color) need to be defined in terms of the

dist from each other when there is a match

(whether blue is close to brown)


15/122

Clarity: excellent for clear explanation of

why a prediction was made. A single ex or

a set of exs can be extracted from the

historical DB for evidence as to why a

prediction should or should not be made.

The system can also communicate when it

is not confident of its prediction


16/122

ROI: Since the individual records of the

nearest neighbor are returned directly

without altering the DB, it is possible to

understand all facets of business behavior

and thus derive a more complete estimate

of the ROI not just from the prediction but

from a variety of different factors


17/122

Where to use clustering and

nearest neighborprediction Personal bankruptcy to computer

recognition of human handwriting

Clustering for clarity: clustering-likerecords are grouped together. High levelview of what is going on in the DB

Clustering-segmentation-birds eye view of

business Commercial offerings:PRIZM &

Microvision


18/122

Grouped the population by demographic

info into segments

Clustering info is then used by the enduser to tag the customers in the db

Business user gets a high level view of

what is happening within the cluster

Once worked with these clusters, users

will know more about customers reaction


19/122

Clustering for outlier analysis

Clustering done to an extent where some

records stick out

Profit in stores, dept


20/122

Nearest Neighbour for Prediction

One particular object can be closer to

another obj than the third object

People have innate sense of ordering on avariety of objects

Apple close to orange than tomato

Toyota corolla, honda civic than porsche Sense of ordering places them in time and

space and makes sense in real world


21/122

Defn of nearness that seems to beubiquitous also allows us to makepredictions

NN prediction alg simply stated as:

Objects that are near to each other willhave similarprediction values as well.

Thus if you know the prediction value ofone of the objects , you can predict it forits nearest neighbors


22/122

Classic ex NN: Text retrieval

Define a document. Look for more suchdocuments

NN looks for imp characteristics with thosedocuments which have been marked asinteresting

Can be used in wide variety of places

Successful use depends on preformatting ofdata, so that nearness can be calculated andwhere individual records can be defined


23/122

Easy for text retrieval but not for time

series kind like stock prices where there is

no inherent order


24/122

Application Score Card

Rules are seldom used for prediction here

Used for unsupervised learning

Clusters: the underlying prediction method

for nearest neighbor technology is nearness

in some feature space. This is same

underlying metric used for most clustering

algorithms although for nearest neighbor the

feature space is shaped in such a way as to

facilitate a particular prediction


25/122

Links: NN techniques can be used forlink analysis as long as the data ispreformatted so that predictor values to

be linked fall within same record Outliers: NN techniques are particularly

good at detecting outliers since theyhave effectively created a space withinwhich it is possible to determine whena record is out of place


26/122

Rules: one strength of NN techniques is

that they take into account all the

predictors to some degree, which is helpful

forprediction but makes for a complex

model that cannot easily be described as a

rule. The systems are also generally

optimized forprediction of new recordsrather than exhaustive extraction of

interesting rules from the DB


27/122

Sequences: NN techniques have beensuccessfully used to make prediction intime sequences. The time values need to

be encoded in records Text : most text retrieval systems are

based around NN tech, and most of themremaining breakthroughs come fromfurther refinements of the predictorweighting algs and the dist calculations


28/122

General Idea

NN is a refinement of clustering in the

sense that both use dist in some feature

space to create either structure in data or

in predictions

NN is a way of automatically determining

the weighting of the importance of the

predictors and how the dist will bemeasured within the feature space


29/122

Clustering is one special case-imp of each

predictor is considered to be equivalent

Ex:set ofpeople and clustering friends


30/122

There is no best way to cluster

Is Clustering on financial status better than

eye color or on food habit?

Clustering with no specific purpose andjust to group data , probably all are ok

Reasons for clustering are ill defined

As they are used more often forexploration and summarization than for

prediction


31/122

How are tradeoffs made when

determining which records fall into

which clusters Ex:aged vs young, classial vs rock

Clustering large no. of records, these

tradeoffs are explicitly defined by theclustering algroithm


32/122

Difference betn clustering and NN

Main distinction: clustering is unsupervised

learning and NN forprediction supervised

learning technique

Unsupervised: there is no particular

reason for the creation of the models

Supervised: prediction

Prediction: patterns presented are most

important


33/122


34/122

How is the space for clustering and

nearest neighbor defined?

Clustering: n dimensional space-assigning

one predictor to each dimension

NN: predictors are also mapped todimensions , but those dimensions are

literally stretched or compressed

according to how important the particular

predictor is in making prediction

Stretching makes it more imp than others


35/122


36/122

The dist betn the cluster and a given datapoint :measured from the centre of massof the cluster

The center of mass of the cluster can becalculated avg ofpredictor value

Clusters defined: soley by centre or by

their centre with some radius attached inwhich all points that fall within the radiusare classified into that cluster


37/122

Centre record :prototypical rec

Normal DB records mapped onto n

dimensional space 2 or 3 dimensions: easy to visualize

More dim: complex


38/122

How is nearness defined?

Clustering and NN: work with n

dimensional space

One record being close to or far fromanother record

Nearness determined: any rec in the

historical DB that is exactly the same as

the rec to be predicted is considered close

and anything else is far away


39/122

Difficulty with this strategy

Unlikely that exact matches of records in

db Perfectly matching rec may be spurious.

Better results: taking vote among several

nearby recs


40/122

Two other dists:

Manhattan dist: adds up the diff betn eachpredicator betn the historical rec and the

rec to be predicted. Euclidean dist:(pythogorous)dist betn twopoints in n dimensions by squaring the

differences of the predictor values for thetwo recs and taking the square root of thesum


41/122

Dist betn xyz & abc:

Age:6

Sal:3100 Color of eye:0

Gender:1

Income:1(high 3 med 2 low 1) Total diff=3108


42/122

Diff dominated by sal

Others whether they match or not does not

matter

To balance, use normalized values

Val:0 to 100

Max diff betn sal in the data set:16543

Betn xyz and abc 3100 which is 19% of max 6+19+0+100+100=225

W i hti th di i di t ith


43/122

Weighting the dimensions: dist with

a purpose:

High income rec when added (Mukesh

ambani) there is outlier created when

clustering is made betn age and income

Normalizing does not help in this case

When near is defined , how imp each

dimensions contribution is?

Ans: it depends on what is to be

accomplished


44/122

Calculating dimension weights

Several automatic ways of calculating the imp of

different dimensions

Ex of document classification and prediction, the

dimensions of space are often the individualwords contained in the document

Ex: entrepreneur occurs or not

The occurs several times: it is little significance Earlier word significant


45/122

Weights: 1. Inverse freq often used: the occurred

10000 docs, word weight = 1/10000=0.0001

Entrepreneur occurred in 100 docs:1/100=0.01

2. importance of the word for the topic to be

predicted. If topic : starting a smallbusiness, words such as entrepreneur and

venture capital will be given higher weights


46/122

Data Mining in doc: special situation: manydimensions and all dim are binary

Other business problems: binary (gender),

categorical (eye color), numeric (revenue)dimensions

Each dim weighted depending on its

relevance to the topic to be predicted Calculation: correlation betn predictor andpredictor value


47/122

Or conditional prob that prediction has

certain value given that predictor has

certain value

Dimension weights calculated via alg

searches :random weights tried initially

Then slowly modified to improve the

accuracy of the system


48/122

Hierarchical and nonhierarchical

clustering

Hierarchical C: small to big clusters

Unsupervised learning

Fewer or Greater no of clusters desired. Depending on appn choose clusters

Extreme: as many as there are no. of recs

In this case recs are optimally similar toeach other (within a cluster there is only

one) but different from other clusters


49/122

Such clustering probably cannot find

useful patterns

No summary info Data is not understood any better

Fewer than original is better

Adv of HC: allow end users to choose fromeither many clusters or only a few


50/122

HC is viewed as tree: smaller clustersmerge together to create next highest levelof clusters and at that level again merge

and so on User can decide what the adequate no. of

clusters that will summarize data andproviding useful info

Single cluster :great summarization butdoes not provide any specific info


51/122

Two algs to HC:

1 . Agglomarative:AC tech start with as

many clusters as there are recs. Eachcluster has one rec. clusters that are

nearest to each other merged. This is

continued till we have single cluster

containing all recs at the top of thehierarchy


52/122

2.Divisive:DC techniques take opp

approach

Start with all rec in one cluster Split into smallerpieces

Further try to split


53/122

Non HC faster to create from historical db

User makes decision about no. of clusters

desired or nearness reqd multiple times run

Start with arbitrary clustering and

iteratively improve by shuffling

Or create by taking one rec at time

depending on the criteria


54/122

Nonhierarchical clustering

Two NHC:

1. single pass methods: db passed thro

only once in order to create clusters

2. Reallocation methods: movement or

reallocation of records from one cluster to

another to create better clusters. Multiple

passes thro db. Faster compared to HC.


55/122

Alg for single pass technique:

Read in a rec from db, determine the

cluster it best fits to (measure of nearness)

If nearest still far away, new cluster with

this rec

Read next rec


56/122

Reading recs: expensive. Single pass

scores better

Problem: large clusters. Decision made

earlier. Sequence in which processed

matters

Reallocation solves this problem by

readjusting the cluster

Optimizes similarity


57/122

Alg for reallocation:

preset the no. of clusters desired

Randomly pick a record to become thecentre or seed for each of these clusters

Go thro db and assign each rec to nearestcluster

Recalculate the centers of the clusters Repeat steps 3 & 4 until there is a

minimum or reallocation betn clusters


58/122

Recs initially assigned may not be good

fits

Recalculating center, clusters that actually

better match are formed

Center moving towards high density and

away from outliers

Predefining no. of cluster may be a bad

idea than driven by data


59/122

There is no one right ans as how

clustering is to be done


60/122

HC

HC has adv over NHC :clusters are

defined solely by data(no predetermined

no.)

No. of clusters :increased or decreased

:moving down or up the hierarchy

Hierarchy started either from top and

dividing further or from bottom andmerging rec at every level


61/122

Merge or split:usually two at a time

Agglomerative alg:

Start with as many clusters as there arerecords,with one record in each cluster

Combine the two nearest clusters into a

larger cluster

Continue until only one cluster remains


62/122

Divisive tech alg

Start with one cluster that contains all therecords in the db

Determine the division of the existing clusterthat best maximizes similarity within clustersand dissimilarity betn clusters

Divide the cluster and repeat on the twosmaller clusters

Stop when some min threshold of clustersize or total no. has been reached or whenthere is only one rec in the cluster


63/122

Divisive techniques:quite expensive to

compute

Separates cluster into every possible

smaller cluster and picks best one (min

avg dist)

Agglomerative preferred

Decisions are made to merge clusters


64/122

Join the clusters whose resulting merged

cluster has min total dist betn all

recs:wards method. Produces symmetric

hierarchy. Good at recovering clusterstructure. Sensitive to outliers. Difficulty in

recovering elongated structure


65/122

Decisions in several ways:

Join the clusters whose nearest recs as

near as possible.-single link method.

Clusters can be joined on a single nearest

pair of recs , tech can create long

snakelike clusters not good at extracting

classical spherical and compact clusters


66/122

Join the clusters whose most distinct recs

are as near as possible.:complete link

method.all recs are linked within some max

dist. Favours:Compact clusters Join the clusters where the avg dist betn all

pairs of recs is as small as possible:Group

avg link method.Includes noth nearest and

distinct, clusters result in elongated singlelink to tight complete link clusters


67/122

Implementation of KNNmethod for object recognition.


68/122

Outline

Introduction.

Description of the problem.

Description of the method.

Image library.

Process of identification.

Example.

Future work.


69/122

Introduction

Generally speaking, problem of objectrecognition is how to teach computer to

recognize different objects on a picture. This is a nontrivial problem. Some of the

main difficulties in solving this problem areseparation of an object from the background

(especially in the presence of clutter orocclusions in the background), and ability torecognize an object with different lighting.


70/122

Introduction.

In this research I am trying to improveaccuracy of object recognition by

implementation of KNN method withnew weighted Hamming-Levenshteindistance that I developed.


71/122

Description of the problem. The problem of object recognition can

be divided into two parts:

1) Location of an object on the picture;

2) Identification of an object.

For example, assume that we have the

following picture:


72/122

Description of the problem.


73/122

Description of the problem. and we have the following library of

images that we will use for object

identification:


74/122

Description of the problem. Our goal is to

identify and locate

objects from ourlibrary on thepicture.


75/122

Description of the problem. In this research I have developed a

method of objects identification

assuming that we already know thelocation of an object, and I am going todevelop the method of location in my

future work.


76/122

Description of the method. We will use KNN method to identify

objects.

For example, assume we need toidentify an object X on a given picture.Letus consider the space of pictures

generated by the image of X andimages from our library.


77/122

Description of the method. In this space we will

pick up, say 5,

closest to X images,and identify X byfinding the pluralityclass of the nearest

pictures.

X

A1

A2

B1

B2

A3

Nearest neighbors: A1, B1, A2, B2, A3

B3

C1

C2


78/122

Description of the method. In order to use KNN method we need to

introduce a measure of similarity between

two pictures. First of all, in order to say something about

similarity between pictures, we need to getsome ideas about the shape of objects on

these pictures. To do this we use edge-detection method (Sobel method, forexample).


79/122


80/122


81/122

Description of the method. Next, we turn the edge-detected picture

into a bit array by thresholding

intensities to 0 or 1. In fact, we aregoing to keep images in the library inthis form.


82/122


83/122

Description of the method. Now, in order to compare two pictures , we

need to compare two 2-dimensional bit

arrays. It may seem natural to use the traditional

Hamming distance for bitstrings which isdefined as follows: given two bitstrings of the

same dimension, the Hamming distance is theminimum number of symbol changes neededto change one bitmap into the other.


84/122

Description of the method. For example, the Hamming distance between

(A) 10001001 and

(B) 11100000 is 4. Notice that the Hamming distance between

(A) 10001001 and

(C) 10010010 is also 4, but intuitively one canregard (C) as a better match for (A) than (B).


85/122

Description of the method. We can modify Hamming distance using

the idea of Levenshtein distance which

is usually used for comparing of textstrings and is obtained by finding thecheapest way to transform one stringinto another. Transformations are the

one-step operations of insertion,deletion and substitution, and eachtransformation has a certain cost.


86/122

Description of the method.Also, since different parts of images

have different level of importance in the

process of recognition, we can assign aweight value for each pixel of an image,and use it in the definition of adistance. For example, we can eliminate

the background of a picture byassigning to the corresponding pixelszero weight.


87/122


88/122


89/122

Description of the method. To get weighted Hamming-Levenshtein

distance between two pictures we

divide each bitstring into several substringsof the same length.

Then we compare correspondingsubstrings using Levenshtein distance,

And summarize all these distancesmultiplied by the average weight of eachsubstring.


90/122

Image library. Each object in the library is represented

by several images taken with different

lighting and from different sides. Eachimage in the library is represented bytwo 2-dimensional arrays. First arraycontains the edge-detected picture

turned into a bit array, and the secondone contains weight values assigned toeach pixel.


91/122

Process of identification. To identify an object,

we turn its edge-detected image into a bit

array by thresholding intensities to 0 or 1. Then we measure distance between this

image and each image from our libraryusing corresponding weight arrays and

weighted Hamming-Levenshtein distance.

Using KNN method we identify the object.


92/122

Example. Below some results are presented in

object identification that was obtained

using the method that has beendescribed.


93/122

Example. Assume that we have the image library

with the following edge-detected images

of objects and weighted images.


94/122


95/122


96/122


97/122

Example. Letus try to

identify the

followingpicture.

Picture 1


98/122

Example. We compare this picture with each

image in our library, and we get the

following table of distances.


99/122

Example. If we select three

closest neighbors of

our pict

ure 1, thenwe can identify it as

Bear.

Bear 1 876

Bear 2 21009

Bear 3 24495Cat 1 27401

Cat 2 25986

Cat 3 24538

Dog 1 21629

Dog 2 26809

Dog 3 25546


100/122

Example. Letus do similar calculations for these

two pictures:

Picture 2. Picture 3.


101/122

Picture 2 Pict ure 3Bear 1 31678 32629

Bear 2 24644 23790

Bear 3 31662 32150

Cat 1 1864 28687

Cat 2 22798 25655

Cat 3 22242 25824

Dog 1 23087 1577

Dog 2 25679 24042

Dog 3 25785 23880


102/122

Fu

tu

re work. Develop a method of location of an object on

the picture.

Develop an idea of reasonable weightdistribution on the images from the library.

Improve the algorithm of identification toallow to compare pictures of different sizes.

Continue to work on improving the definitionof weighted Hamming-Levenshtein distance.

I t d ti


103/122

Optical Character Recognition (OCR)

Predict the label of each image using the classification

function learned from training

OCR is basically a classification task on

multivariate data

Pixel Values Variables

Each type of character Class

Objective: To recognise images of Handwritten digitsbased on classification methods for multivariate

data.

Introduction

Handwritten Digit dataHandwritten Digit data


104/122

16 x16 (= 256 pixel) Grey Scale images ofdigits in range 0-9 Xi=[xi1, xi2, . xi256]

yi { 0,1,2,3,4,5,6,7,8,9}

9298 labelled samples Training set ~ 1000 images

Test set

Randomly selected from the full data base

Basic idea Correctly identify the digit given an image

2 4 6 8 1 0 1 2 1 4 1 6

2

4

6

8

10

12

14

16

xij ]1,0[

16

16

gg

Di i d ti PCA


105/122

Dimension reduction - PCA

PCA done on the mean centered images

The eigenvectors of256x256matrix are

called theE

igen digits (256 dimensional)

The larger an Eigen value the more

important is that Eigen digit.

The ith PC of an image X is

yi=eiX

AVERAG E I M AG E

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

AVERAGE DIGIT

5 10 15

5

10

15

5 10 15

5

10

15

5 10 1 5

5

10

15

5 10 15

5

10

15

5 10 15

5

10

15

5 10 1 5

5

10

15

5 10 15

5

10

15

5 10 1 5

5

10

15

5 10 15

5

10

15

5 10 15

5

10

15

EIGEN D IGITS

PCA ( ti d )


106/122

PCA (continued)

Based on the Eigen values first 64 PCs were found to besignificant

Variance captured ~ 92.74%

Any image represented by its PC: Y= [y1 y2.....y64 ]

Reduced Data Matrix with 64 variables

Y= 1000 x 64 matrix


107/122

0 5 0 1 0 0 1 5 0 2 0 0 2 5 06410

20

30

40

50

60

70

80

90

10 0

9 2 . 7 4

No of Pr inciple Com ponents

%V

arianceexplained

(Cumulative)

Cum ulat ive Percentage var iance e xplained vs No of Pr inciple Com ponents u sed

64

9 2 . 7 4

Interpreting the PCs as Image


108/122

Interpreting the PCs as Image

F

eatures The Eigen vectors are the rotation of the originalaxes to more meaningful directions. The PCsare the projection of the data onto each of thesenew axes.

Image Reconstruction: The original image can be reconstructed by projecting

the PCs back to old axes.

Using the most significant PC will give a

reconstructed image that is close to original image. These features can be used for carrying out further

investigations e.g. Classification!!

I R t ti


109/122

Image Reconstruction

Mean Centered Image: I=(X-Xmean) PC as Features:

yi = eiI

Y= [y1, y2,.. y64]

= EI where E=[e1 e2. e64]

Reconstruction: Xrecon= E*Y + XmeanACTUAL IM AGE FROM TEST SE T

5 10 15

2

4

6

8

10

12

14

16

COM PLETELY REC ONSTRUCTED

IM AGE USING ALL 256 PRICIPLE COMPONENTS

5 10 1 5

2

4

6

8

10

12

14

16

RECONSTRUCTED IMAGE USING

15 0 PRICIPLE COMPONENTS

5 1 0 15

2

4

6

8

10

12

14

16

RECONSTRUCTED IMAGE USING

64 PRICIPLE COM PONENTS

5 10 1 5

2

4

6

8

10

12

14

16

N lit t t PC


110/122

Normality test on PCs

-4 -2 0 2 4-1

-0. 5

0

0. 5

1

Standard Normal Quant i les

Quantiles

ofInputSam

ple

QQ Plot of Sample Data versus Standard Normal

P r in c ip le Co mp o n en t No 1

-4 -2 0 2 4-1

-0. 5

0

0. 5

1


Quantiles

ofInputSam

ple



-4 -2 0 2 4-1

-0. 5

0

0. 5

1


Quantiles

ofInputSam

ple



-4 -2 0 2 4-1

-0. 5

0

0. 5

1


Quantiles

ofInputSam

ple



-4 -2 0 2 4-1. 5

-1

-0. 5

0

0. 5

1

1. 5


Quantiles

ofInputSam

ple



-4 -2 0 2 4-1. 5

-1

-0. 5

0

0. 5

1

1. 5


Quantiles

ofInputSam

ple



-4 -2 0 2 4-3

-2

-1

0

1

2


Quantiles

ofInputSam

ple



-4 -2 0 2 4-3

-2

-1

0

1

2

3


Quantiles

ofInputSam

ple



-4 -2 0 2 4-5

0

5


Quantiles

ofInputSam

ple



Cl ifi ti


111/122

Classification

Principle Components used as features of

images

LDA assuming multivariate normality of thefeature groups and common covariance

Fisher discriminant procedure which assumes

only common covariance

Cl ifi ti ( td )


112/122

Classification (contd..)

Equal cost of misclassification

Misclassification error rate:

APER based on training data AER on the validation data

Error rate using different number

of PCs were compared

Averaged overseveral random

sampling of training

and validation data

from the full data set.

Performing LDA


113/122

Performing LDA

Prior probabilities of each class were taken

as the frequency of that class in data.

Equivalence of covariance matrix Strong Assumption

Error rates used to check validity of

assumption

Spooled used for covariance matrix


114/122


115/122

Fisher Discriminant Results


116/122


r=2 discriminants

APER

AER

Both AER and APER are very high

No of PCs 256 150 64

APER % 32 34.5 37.4

No of PCs 256 150 64

AER % 45 42 40



117/122


r=7 discriminants

APER

AER

Considerable improvement in AER and APER

Performance is close to LDA

Using 64 PCs is better

No of PCs 256 150 64

APER % 3.2 4.8 7.9

No of PCs 256 150 64

AER % 14.1 12.4 10.8



118/122


r=9(all) discriminants

APER

AER

No significant performance gain from r=7

Error rates are ~ LDA (as expected!)

No of PCs 256 150 64

APER % 1.6 4.3 6.4

No of PCs 256 150 64

AER % 13.21 10.55 9.86

Nearest Neighbour Classifier


119/122

g

No assumption aboutdistribution of data

Euclidean distance to findnearest neighbour

Test point assigned to

Class 2

Class 2

Class 1

Finds the nearest neighbours from the trainingset to test image and assigns its label to testimage.

K-Nearest Neighbour Classifier


120/122

g

(KNN) Compute the k nearest neighbours and

assign the class by majority vote.

k= 3

Test point assigned to

Class 1

Class 2 ( 1 vote )

Class 1 ( 2 votes )

1-NN Classification Results:


121/122

1 NN Classification Results:

No of PCs 256 150 64AER % 7.09 7.01 6.45

Test error rates have improved compared toLDA andFisher

Using 64 PCs gives better results

Using higherks does not show improvementin recognition rate

Misclassification in NN:


122/122

Misclassification in NN:

0 1 2 3 4 5 6 7 8 9

0 1376 0 4 2 0 5 12 2 0 0

1 0 1113 1 0 1 0 2 0 2 0

2 22 9 728 17 4 4 6 16 18 2

3 4 0 4 690 2 26 0 4 6 3

4 3 15 9 0 687 0 7 2 4 32

5 9 3 12 37 5 517 32 0 23 9

6 10 3 5 0 3 2 714 0 3 2

7 0 6 1 0 19 0 0 657 1 20

8 8 11 1 26 7 7 8 5 547 13

9 6 1 2 0 23 0 0 32 0 664

Actual

Recognised as

Euclidean distances between transformed

images of same class can be very high

nearest neighbour and clustering

Documents