nearest neighbour and clustering

Upload: natarajansubramanyam

Post on 06-Apr-2018

231 views

Category:

Documents


2 download

TRANSCRIPT

  • 8/3/2019 Nearest Neighbour and Clustering

    1/122

    Nearest Neighbor and Clustering

  • 8/3/2019 Nearest Neighbour and Clustering

    2/122

  • 8/3/2019 Nearest Neighbour and Clustering

    3/122

    Nearest Neighbour and Clustering

    Nearest Neighbor Clustering

    Used for prediction as well as

    consolidation.

    Used mostly for consolidating data into a

    high-level view and general grouping of

    records into like behaviors.

    Space is defined by the problem

    to be solved (supervised learning).

    Space is defined as default n-

    dimensional space, or is defined by the

    user, or is a predefined space driven by

    past experience (unsupervised learning).

    Generally only uses distance

    metrics to determine nearness.

    Can use other metrics besides distance to

    determine nearness of two records - for

    example linking two points together.

  • 8/3/2019 Nearest Neighbour and Clustering

    4/122

    K Nearest Neighbors

    K Nearest Neighbors

    Advantage Nonparametric architecture

    Simple

    Powerful

    Requires no training time

    Disadvantage

    Memory intensive

    Classification/estimation is slow

  • 8/3/2019 Nearest Neighbour and Clustering

    5/122

    K Nearest Neighbors

    The key issues involved in training this

    model includes setting

    the variableK

    Validation techniques (ex. Cross validation)

    the type of distant metric

    Euclidean measure

    2

    1

    )(),( !

    !D

    i

    YiXiYXDist

  • 8/3/2019 Nearest Neighbour and Clustering

    6/122

    Figure K Nearest Neighbors Example

    X

    Stored training set patterns

    X input pattern for classification

    --- Euclidean distance measure tothe nearest three patterns

  • 8/3/2019 Nearest Neighbour and Clustering

    7/122

    Store all input data in the training set

    For each pattern in the test set

    Search for the K nearest patterns to the

    input pattern using a Euclidean distance

    measure

    For classification, compute the confidence for

    each class as Ci/K,

    (where Ci is the number ofpatterns among

    the K nearest patterns belonging to class i.)

    The classification for the input pattern is the

    class with the highest confidence.

  • 8/3/2019 Nearest Neighbour and Clustering

    8/122

    Training parameters and typical

    settings Number of nearest neighbors

    The numbers of nearest neighbors (K) should be

    based on cross validation over a number ofK setting.

    When k=1 is a good baseline model to benchmarkagainst.

    A good rule-of-thumb numbers is k should be less

    than the square root of the total number of training

    patterns.

  • 8/3/2019 Nearest Neighbour and Clustering

    9/122

    Training parameters and typical

    settings Input compression

    Since KNN is very storage intensive, we may want tocompress data patterns as a preprocessing stepbefore classification.

    Using input compression will result in slightly worseperformance.

    Sometimes using compression will improveperformance because it performs automatic

    normalization of the data which can equalize theeffect of each input in the Euclidean distancemeasure.

  • 8/3/2019 Nearest Neighbour and Clustering

    10/122

    Nearest Neighbour and Clustering

    Oldest techniques used in DM

    Like records are grouped or clustered

    together and put into same grouping Nearest neighbor prediction tech quite

    close to clustering

    To find prediction value in one record, lookfor similar records with similarpredictor

    values in the historical DB

  • 8/3/2019 Nearest Neighbour and Clustering

    11/122

    Use the prediction value of the recordwhich is nearest to the unknown rec

    Ex:laundry uses clustering

    In Business, clusters-more dynamic

    Which cluster a rec falls, may changedaily, monthly

    Therefore is difficult to decide Another ex NN:income group of

    neighbours

  • 8/3/2019 Nearest Neighbour and Clustering

    12/122

    Best way to predict an unknown persons

    income possibly choose the closest

    persons

    Nearest neighbourprediction alg works on

    DB very much same way

    Many factors-nearest condn

    Persons locn,school attended,degree

    attained etc..

  • 8/3/2019 Nearest Neighbour and Clustering

    13/122

    Business Score Card

    Measures critical to business successdeals with: ease of deployment, real worldproblems avoiding serious mistakes as

    well as achieving big successes DM tech needs to be :easy to use, deploy

    in an automated fashion as possible

    Provide clear understandable answers Provide answer that can be converted into

    ROI

  • 8/3/2019 Nearest Neighbour and Clustering

    14/122

    BSC

    Automation: NN are relatively automated,

    although some preprocessing is performed

    in converting predictors into values that

    can be used in a measure of distance

    Unordered categorical predictors (eye

    color) need to be defined in terms of the

    dist from each other when there is a match

    (whether blue is close to brown)

  • 8/3/2019 Nearest Neighbour and Clustering

    15/122

    Clarity: excellent for clear explanation of

    why a prediction was made. A single ex or

    a set of exs can be extracted from the

    historical DB for evidence as to why a

    prediction should or should not be made.

    The system can also communicate when it

    is not confident of its prediction

  • 8/3/2019 Nearest Neighbour and Clustering

    16/122

    ROI: Since the individual records of the

    nearest neighbor are returned directly

    without altering the DB, it is possible to

    understand all facets of business behavior

    and thus derive a more complete estimate

    of the ROI not just from the prediction but

    from a variety of different factors

  • 8/3/2019 Nearest Neighbour and Clustering

    17/122

    Where to use clustering and

    nearest neighborprediction Personal bankruptcy to computer

    recognition of human handwriting

    Clustering for clarity: clustering-likerecords are grouped together. High levelview of what is going on in the DB

    Clustering-segmentation-birds eye view of

    business Commercial offerings:PRIZM &

    Microvision

  • 8/3/2019 Nearest Neighbour and Clustering

    18/122

    Grouped the population by demographic

    info into segments

    Clustering info is then used by the enduser to tag the customers in the db

    Business user gets a high level view of

    what is happening within the cluster

    Once worked with these clusters, users

    will know more about customers reaction

  • 8/3/2019 Nearest Neighbour and Clustering

    19/122

    Clustering for outlier analysis

    Clustering done to an extent where some

    records stick out

    Profit in stores, dept

  • 8/3/2019 Nearest Neighbour and Clustering

    20/122

    Nearest Neighbour for Prediction

    One particular object can be closer to

    another obj than the third object

    People have innate sense of ordering on avariety of objects

    Apple close to orange than tomato

    Toyota corolla, honda civic than porsche Sense of ordering places them in time and

    space and makes sense in real world

  • 8/3/2019 Nearest Neighbour and Clustering

    21/122

    Defn of nearness that seems to beubiquitous also allows us to makepredictions

    NN prediction alg simply stated as:

    Objects that are near to each other willhave similarprediction values as well.

    Thus if you know the prediction value ofone of the objects , you can predict it forits nearest neighbors

  • 8/3/2019 Nearest Neighbour and Clustering

    22/122

    Classic ex NN: Text retrieval

    Define a document. Look for more suchdocuments

    NN looks for imp characteristics with thosedocuments which have been marked asinteresting

    Can be used in wide variety of places

    Successful use depends on preformatting ofdata, so that nearness can be calculated andwhere individual records can be defined

  • 8/3/2019 Nearest Neighbour and Clustering

    23/122

    Easy for text retrieval but not for time

    series kind like stock prices where there is

    no inherent order

  • 8/3/2019 Nearest Neighbour and Clustering

    24/122

    Application Score Card

    Rules are seldom used for prediction here

    Used for unsupervised learning

    Clusters: the underlying prediction method

    for nearest neighbor technology is nearness

    in some feature space. This is same

    underlying metric used for most clustering

    algorithms although for nearest neighbor the

    feature space is shaped in such a way as to

    facilitate a particular prediction

  • 8/3/2019 Nearest Neighbour and Clustering

    25/122

    Links: NN techniques can be used forlink analysis as long as the data ispreformatted so that predictor values to

    be linked fall within same record Outliers: NN techniques are particularly

    good at detecting outliers since theyhave effectively created a space withinwhich it is possible to determine whena record is out of place

  • 8/3/2019 Nearest Neighbour and Clustering

    26/122

    Rules: one strength of NN techniques is

    that they take into account all the

    predictors to some degree, which is helpful

    forprediction but makes for a complex

    model that cannot easily be described as a

    rule. The systems are also generally

    optimized forprediction of new recordsrather than exhaustive extraction of

    interesting rules from the DB

  • 8/3/2019 Nearest Neighbour and Clustering

    27/122

    Sequences: NN techniques have beensuccessfully used to make prediction intime sequences. The time values need to

    be encoded in records Text : most text retrieval systems are

    based around NN tech, and most of themremaining breakthroughs come fromfurther refinements of the predictorweighting algs and the dist calculations

  • 8/3/2019 Nearest Neighbour and Clustering

    28/122

    General Idea

    NN is a refinement of clustering in the

    sense that both use dist in some feature

    space to create either structure in data or

    in predictions

    NN is a way of automatically determining

    the weighting of the importance of the

    predictors and how the dist will bemeasured within the feature space

  • 8/3/2019 Nearest Neighbour and Clustering

    29/122

    Clustering is one special case-imp of each

    predictor is considered to be equivalent

    Ex:set ofpeople and clustering friends

  • 8/3/2019 Nearest Neighbour and Clustering

    30/122

    There is no best way to cluster

    Is Clustering on financial status better than

    eye color or on food habit?

    Clustering with no specific purpose andjust to group data , probably all are ok

    Reasons for clustering are ill defined

    As they are used more often forexploration and summarization than for

    prediction

  • 8/3/2019 Nearest Neighbour and Clustering

    31/122

    How are tradeoffs made when

    determining which records fall into

    which clusters Ex:aged vs young, classial vs rock

    Clustering large no. of records, these

    tradeoffs are explicitly defined by theclustering algroithm

  • 8/3/2019 Nearest Neighbour and Clustering

    32/122

    Difference betn clustering and NN

    Main distinction: clustering is unsupervised

    learning and NN forprediction supervised

    learning technique

    Unsupervised: there is no particular

    reason for the creation of the models

    Supervised: prediction

    Prediction: patterns presented are most

    important

  • 8/3/2019 Nearest Neighbour and Clustering

    33/122

  • 8/3/2019 Nearest Neighbour and Clustering

    34/122

    How is the space for clustering and

    nearest neighbor defined?

    Clustering: n dimensional space-assigning

    one predictor to each dimension

    NN: predictors are also mapped todimensions , but those dimensions are

    literally stretched or compressed

    according to how important the particular

    predictor is in making prediction

    Stretching makes it more imp than others

  • 8/3/2019 Nearest Neighbour and Clustering

    35/122

  • 8/3/2019 Nearest Neighbour and Clustering

    36/122

    The dist betn the cluster and a given datapoint :measured from the centre of massof the cluster

    The center of mass of the cluster can becalculated avg ofpredictor value

    Clusters defined: soley by centre or by

    their centre with some radius attached inwhich all points that fall within the radiusare classified into that cluster

  • 8/3/2019 Nearest Neighbour and Clustering

    37/122

    Centre record :prototypical rec

    Normal DB records mapped onto n

    dimensional space 2 or 3 dimensions: easy to visualize

    More dim: complex

  • 8/3/2019 Nearest Neighbour and Clustering

    38/122

    How is nearness defined?

    Clustering and NN: work with n

    dimensional space

    One record being close to or far fromanother record

    Nearness determined: any rec in the

    historical DB that is exactly the same as

    the rec to be predicted is considered close

    and anything else is far away

  • 8/3/2019 Nearest Neighbour and Clustering

    39/122

    Difficulty with this strategy

    Unlikely that exact matches of records in

    db Perfectly matching rec may be spurious.

    Better results: taking vote among several

    nearby recs

  • 8/3/2019 Nearest Neighbour and Clustering

    40/122

    Two other dists:

    Manhattan dist: adds up the diff betn eachpredicator betn the historical rec and the

    rec to be predicted. Euclidean dist:(pythogorous)dist betn twopoints in n dimensions by squaring the

    differences of the predictor values for thetwo recs and taking the square root of thesum

  • 8/3/2019 Nearest Neighbour and Clustering

    41/122

    Dist betn xyz & abc:

    Age:6

    Sal:3100 Color of eye:0

    Gender:1

    Income:1(high 3 med 2 low 1) Total diff=3108

  • 8/3/2019 Nearest Neighbour and Clustering

    42/122

    Diff dominated by sal

    Others whether they match or not does not

    matter

    To balance, use normalized values

    Val:0 to 100

    Max diff betn sal in the data set:16543

    Betn xyz and abc 3100 which is 19% of max 6+19+0+100+100=225

    W i hti th di i di t ith

  • 8/3/2019 Nearest Neighbour and Clustering

    43/122

    Weighting the dimensions: dist with

    a purpose:

    High income rec when added (Mukesh

    ambani) there is outlier created when

    clustering is made betn age and income

    Normalizing does not help in this case

    When near is defined , how imp each

    dimensions contribution is?

    Ans: it depends on what is to be

    accomplished

  • 8/3/2019 Nearest Neighbour and Clustering

    44/122

    Calculating dimension weights

    Several automatic ways of calculating the imp of

    different dimensions

    Ex of document classification and prediction, the

    dimensions of space are often the individualwords contained in the document

    Ex: entrepreneur occurs or not

    The occurs several times: it is little significance Earlier word significant

  • 8/3/2019 Nearest Neighbour and Clustering

    45/122

    Weights: 1. Inverse freq often used: the occurred

    10000 docs, word weight = 1/10000=0.0001

    Entrepreneur occurred in 100 docs:1/100=0.01

    2. importance of the word for the topic to be

    predicted. If topic : starting a smallbusiness, words such as entrepreneur and

    venture capital will be given higher weights

  • 8/3/2019 Nearest Neighbour and Clustering

    46/122

    Data Mining in doc: special situation: manydimensions and all dim are binary

    Other business problems: binary (gender),

    categorical (eye color), numeric (revenue)dimensions

    Each dim weighted depending on its

    relevance to the topic to be predicted Calculation: correlation betn predictor andpredictor value

  • 8/3/2019 Nearest Neighbour and Clustering

    47/122

    Or conditional prob that prediction has

    certain value given that predictor has

    certain value

    Dimension weights calculated via alg

    searches :random weights tried initially

    Then slowly modified to improve the

    accuracy of the system

  • 8/3/2019 Nearest Neighbour and Clustering

    48/122

    Hierarchical and nonhierarchical

    clustering

    Hierarchical C: small to big clusters

    Unsupervised learning

    Fewer or Greater no of clusters desired. Depending on appn choose clusters

    Extreme: as many as there are no. of recs

    In this case recs are optimally similar toeach other (within a cluster there is only

    one) but different from other clusters

  • 8/3/2019 Nearest Neighbour and Clustering

    49/122

    Such clustering probably cannot find

    useful patterns

    No summary info Data is not understood any better

    Fewer than original is better

    Adv of HC: allow end users to choose fromeither many clusters or only a few

  • 8/3/2019 Nearest Neighbour and Clustering

    50/122

    HC is viewed as tree: smaller clustersmerge together to create next highest levelof clusters and at that level again merge

    and so on User can decide what the adequate no. of

    clusters that will summarize data andproviding useful info

    Single cluster :great summarization butdoes not provide any specific info

  • 8/3/2019 Nearest Neighbour and Clustering

    51/122

    Two algs to HC:

    1 . Agglomarative:AC tech start with as

    many clusters as there are recs. Eachcluster has one rec. clusters that are

    nearest to each other merged. This is

    continued till we have single cluster

    containing all recs at the top of thehierarchy

  • 8/3/2019 Nearest Neighbour and Clustering

    52/122

    2.Divisive:DC techniques take opp

    approach

    Start with all rec in one cluster Split into smallerpieces

    Further try to split

  • 8/3/2019 Nearest Neighbour and Clustering

    53/122

    Non HC faster to create from historical db

    User makes decision about no. of clusters

    desired or nearness reqd multiple times run

    Start with arbitrary clustering and

    iteratively improve by shuffling

    Or create by taking one rec at time

    depending on the criteria

  • 8/3/2019 Nearest Neighbour and Clustering

    54/122

    Nonhierarchical clustering

    Two NHC:

    1. single pass methods: db passed thro

    only once in order to create clusters

    2. Reallocation methods: movement or

    reallocation of records from one cluster to

    another to create better clusters. Multiple

    passes thro db. Faster compared to HC.

  • 8/3/2019 Nearest Neighbour and Clustering

    55/122

    Alg for single pass technique:

    Read in a rec from db, determine the

    cluster it best fits to (measure of nearness)

    If nearest still far away, new cluster with

    this rec

    Read next rec

  • 8/3/2019 Nearest Neighbour and Clustering

    56/122

    Reading recs: expensive. Single pass

    scores better

    Problem: large clusters. Decision made

    earlier. Sequence in which processed

    matters

    Reallocation solves this problem by

    readjusting the cluster

    Optimizes similarity

  • 8/3/2019 Nearest Neighbour and Clustering

    57/122

    Alg for reallocation:

    preset the no. of clusters desired

    Randomly pick a record to become thecentre or seed for each of these clusters

    Go thro db and assign each rec to nearestcluster

    Recalculate the centers of the clusters Repeat steps 3 & 4 until there is a

    minimum or reallocation betn clusters

  • 8/3/2019 Nearest Neighbour and Clustering

    58/122

    Recs initially assigned may not be good

    fits

    Recalculating center, clusters that actually

    better match are formed

    Center moving towards high density and

    away from outliers

    Predefining no. of cluster may be a bad

    idea than driven by data

  • 8/3/2019 Nearest Neighbour and Clustering

    59/122

    There is no one right ans as how

    clustering is to be done

  • 8/3/2019 Nearest Neighbour and Clustering

    60/122

    HC

    HC has adv over NHC :clusters are

    defined solely by data(no predetermined

    no.)

    No. of clusters :increased or decreased

    :moving down or up the hierarchy

    Hierarchy started either from top and

    dividing further or from bottom andmerging rec at every level

  • 8/3/2019 Nearest Neighbour and Clustering

    61/122

    Merge or split:usually two at a time

    Agglomerative alg:

    Start with as many clusters as there arerecords,with one record in each cluster

    Combine the two nearest clusters into a

    larger cluster

    Continue until only one cluster remains

  • 8/3/2019 Nearest Neighbour and Clustering

    62/122

    Divisive tech alg

    Start with one cluster that contains all therecords in the db

    Determine the division of the existing clusterthat best maximizes similarity within clustersand dissimilarity betn clusters

    Divide the cluster and repeat on the twosmaller clusters

    Stop when some min threshold of clustersize or total no. has been reached or whenthere is only one rec in the cluster

  • 8/3/2019 Nearest Neighbour and Clustering

    63/122

    Divisive techniques:quite expensive to

    compute

    Separates cluster into every possible

    smaller cluster and picks best one (min

    avg dist)

    Agglomerative preferred

    Decisions are made to merge clusters

  • 8/3/2019 Nearest Neighbour and Clustering

    64/122

    Join the clusters whose resulting merged

    cluster has min total dist betn all

    recs:wards method. Produces symmetric

    hierarchy. Good at recovering clusterstructure. Sensitive to outliers. Difficulty in

    recovering elongated structure

  • 8/3/2019 Nearest Neighbour and Clustering

    65/122

    Decisions in several ways:

    Join the clusters whose nearest recs as

    near as possible.-single link method.

    Clusters can be joined on a single nearest

    pair of recs , tech can create long

    snakelike clusters not good at extracting

    classical spherical and compact clusters

  • 8/3/2019 Nearest Neighbour and Clustering

    66/122

    Join the clusters whose most distinct recs

    are as near as possible.:complete link

    method.all recs are linked within some max

    dist. Favours:Compact clusters Join the clusters where the avg dist betn all

    pairs of recs is as small as possible:Group

    avg link method.Includes noth nearest and

    distinct, clusters result in elongated singlelink to tight complete link clusters

  • 8/3/2019 Nearest Neighbour and Clustering

    67/122

    Implementation of KNNmethod for object recognition.

  • 8/3/2019 Nearest Neighbour and Clustering

    68/122

    Outline

    Introduction.

    Description of the problem.

    Description of the method.

    Image library.

    Process of identification.

    Example.

    Future work.

  • 8/3/2019 Nearest Neighbour and Clustering

    69/122

    Introduction

    Generally speaking, problem of objectrecognition is how to teach computer to

    recognize different objects on a picture. This is a nontrivial problem. Some of the

    main difficulties in solving this problem areseparation of an object from the background

    (especially in the presence of clutter orocclusions in the background), and ability torecognize an object with different lighting.

  • 8/3/2019 Nearest Neighbour and Clustering

    70/122

    Introduction.

    In this research I am trying to improveaccuracy of object recognition by

    implementation of KNN method withnew weighted Hamming-Levenshteindistance that I developed.

  • 8/3/2019 Nearest Neighbour and Clustering

    71/122

    Description of the problem. The problem of object recognition can

    be divided into two parts:

    1) Location of an object on the picture;

    2) Identification of an object.

    For example, assume that we have the

    following picture:

  • 8/3/2019 Nearest Neighbour and Clustering

    72/122

    Description of the problem.

  • 8/3/2019 Nearest Neighbour and Clustering

    73/122

    Description of the problem. and we have the following library of

    images that we will use for object

    identification:

  • 8/3/2019 Nearest Neighbour and Clustering

    74/122

    Description of the problem. Our goal is to

    identify and locate

    objects from ourlibrary on thepicture.

  • 8/3/2019 Nearest Neighbour and Clustering

    75/122

    Description of the problem. In this research I have developed a

    method of objects identification

    assuming that we already know thelocation of an object, and I am going todevelop the method of location in my

    future work.

  • 8/3/2019 Nearest Neighbour and Clustering

    76/122

    Description of the method. We will use KNN method to identify

    objects.

    For example, assume we need toidentify an object X on a given picture.Letus consider the space of pictures

    generated by the image of X andimages from our library.

  • 8/3/2019 Nearest Neighbour and Clustering

    77/122

    Description of the method. In this space we will

    pick up, say 5,

    closest to X images,and identify X byfinding the pluralityclass of the nearest

    pictures.

    X

    A1

    A2

    B1

    B2

    A3

    Nearest neighbors: A1, B1, A2, B2, A3

    B3

    C1

    C2

  • 8/3/2019 Nearest Neighbour and Clustering

    78/122

    Description of the method. In order to use KNN method we need to

    introduce a measure of similarity between

    two pictures. First of all, in order to say something about

    similarity between pictures, we need to getsome ideas about the shape of objects on

    these pictures. To do this we use edge-detection method (Sobel method, forexample).

  • 8/3/2019 Nearest Neighbour and Clustering

    79/122

  • 8/3/2019 Nearest Neighbour and Clustering

    80/122

  • 8/3/2019 Nearest Neighbour and Clustering

    81/122

    Description of the method. Next, we turn the edge-detected picture

    into a bit array by thresholding

    intensities to 0 or 1. In fact, we aregoing to keep images in the library inthis form.

  • 8/3/2019 Nearest Neighbour and Clustering

    82/122

  • 8/3/2019 Nearest Neighbour and Clustering

    83/122

    Description of the method. Now, in order to compare two pictures , we

    need to compare two 2-dimensional bit

    arrays. It may seem natural to use the traditional

    Hamming distance for bitstrings which isdefined as follows: given two bitstrings of the

    same dimension, the Hamming distance is theminimum number of symbol changes neededto change one bitmap into the other.

  • 8/3/2019 Nearest Neighbour and Clustering

    84/122

    Description of the method. For example, the Hamming distance between

    (A) 10001001 and

    (B) 11100000 is 4. Notice that the Hamming distance between

    (A) 10001001 and

    (C) 10010010 is also 4, but intuitively one canregard (C) as a better match for (A) than (B).

  • 8/3/2019 Nearest Neighbour and Clustering

    85/122

    Description of the method. We can modify Hamming distance using

    the idea of Levenshtein distance which

    is usually used for comparing of textstrings and is obtained by finding thecheapest way to transform one stringinto another. Transformations are the

    one-step operations of insertion,deletion and substitution, and eachtransformation has a certain cost.

  • 8/3/2019 Nearest Neighbour and Clustering

    86/122

    Description of the method.Also, since different parts of images

    have different level of importance in the

    process of recognition, we can assign aweight value for each pixel of an image,and use it in the definition of adistance. For example, we can eliminate

    the background of a picture byassigning to the corresponding pixelszero weight.

  • 8/3/2019 Nearest Neighbour and Clustering

    87/122

  • 8/3/2019 Nearest Neighbour and Clustering

    88/122

  • 8/3/2019 Nearest Neighbour and Clustering

    89/122

    Description of the method. To get weighted Hamming-Levenshtein

    distance between two pictures we

    divide each bitstring into several substringsof the same length.

    Then we compare correspondingsubstrings using Levenshtein distance,

    And summarize all these distancesmultiplied by the average weight of eachsubstring.

  • 8/3/2019 Nearest Neighbour and Clustering

    90/122

    Image library. Each object in the library is represented

    by several images taken with different

    lighting and from different sides. Eachimage in the library is represented bytwo 2-dimensional arrays. First arraycontains the edge-detected picture

    turned into a bit array, and the secondone contains weight values assigned toeach pixel.

  • 8/3/2019 Nearest Neighbour and Clustering

    91/122

    Process of identification. To identify an object,

    we turn its edge-detected image into a bit

    array by thresholding intensities to 0 or 1. Then we measure distance between this

    image and each image from our libraryusing corresponding weight arrays and

    weighted Hamming-Levenshtein distance.

    Using KNN method we identify the object.

  • 8/3/2019 Nearest Neighbour and Clustering

    92/122

    Example. Below some results are presented in

    object identification that was obtained

    using the method that has beendescribed.

  • 8/3/2019 Nearest Neighbour and Clustering

    93/122

    Example. Assume that we have the image library

    with the following edge-detected images

    of objects and weighted images.

  • 8/3/2019 Nearest Neighbour and Clustering

    94/122

  • 8/3/2019 Nearest Neighbour and Clustering

    95/122

  • 8/3/2019 Nearest Neighbour and Clustering

    96/122

  • 8/3/2019 Nearest Neighbour and Clustering

    97/122

    Example. Letus try to

    identify the

    followingpicture.

    Picture 1

  • 8/3/2019 Nearest Neighbour and Clustering

    98/122

    Example. We compare this picture with each

    image in our library, and we get the

    following table of distances.

  • 8/3/2019 Nearest Neighbour and Clustering

    99/122

    Example. If we select three

    closest neighbors of

    our pict

    ure 1, thenwe can identify it as

    Bear.

    Bear 1 876

    Bear 2 21009

    Bear 3 24495Cat 1 27401

    Cat 2 25986

    Cat 3 24538

    Dog 1 21629

    Dog 2 26809

    Dog 3 25546

  • 8/3/2019 Nearest Neighbour and Clustering

    100/122

    Example. Letus do similar calculations for these

    two pictures:

    Picture 2. Picture 3.

  • 8/3/2019 Nearest Neighbour and Clustering

    101/122

    Picture 2 Pict ure 3Bear 1 31678 32629

    Bear 2 24644 23790

    Bear 3 31662 32150

    Cat 1 1864 28687

    Cat 2 22798 25655

    Cat 3 22242 25824

    Dog 1 23087 1577

    Dog 2 25679 24042

    Dog 3 25785 23880

  • 8/3/2019 Nearest Neighbour and Clustering

    102/122

    Fu

    tu

    re work. Develop a method of location of an object on

    the picture.

    Develop an idea of reasonable weightdistribution on the images from the library.

    Improve the algorithm of identification toallow to compare pictures of different sizes.

    Continue to work on improving the definitionof weighted Hamming-Levenshtein distance.

    I t d ti

  • 8/3/2019 Nearest Neighbour and Clustering

    103/122

    Optical Character Recognition (OCR)

    Predict the label of each image using the classification

    function learned from training

    OCR is basically a classification task on

    multivariate data

    Pixel Values Variables

    Each type of character Class

    Objective: To recognise images of Handwritten digitsbased on classification methods for multivariate

    data.

    Introduction

    Handwritten Digit dataHandwritten Digit data

  • 8/3/2019 Nearest Neighbour and Clustering

    104/122

    16 x16 (= 256 pixel) Grey Scale images ofdigits in range 0-9 Xi=[xi1, xi2, . xi256]

    yi { 0,1,2,3,4,5,6,7,8,9}

    9298 labelled samples Training set ~ 1000 images

    Test set

    Randomly selected from the full data base

    Basic idea Correctly identify the digit given an image

    2 4 6 8 1 0 1 2 1 4 1 6

    2

    4

    6

    8

    10

    12

    14

    16

    xij ]1,0[

    16

    16

    gg

    Di i d ti PCA

  • 8/3/2019 Nearest Neighbour and Clustering

    105/122

    Dimension reduction - PCA

    PCA done on the mean centered images

    The eigenvectors of256x256matrix are

    called theE

    igen digits (256 dimensional)

    The larger an Eigen value the more

    important is that Eigen digit.

    The ith PC of an image X is

    yi=eiX

    AVERAG E I M AG E

    2 4 6 8 10 12 14 16

    2

    4

    6

    8

    10

    12

    14

    16

    AVERAGE DIGIT

    5 10 15

    5

    10

    15

    5 10 15

    5

    10

    15

    5 10 1 5

    5

    10

    15

    5 10 15

    5

    10

    15

    5 10 15

    5

    10

    15

    5 10 1 5

    5

    10

    15

    5 10 15

    5

    10

    15

    5 10 1 5

    5

    10

    15

    5 10 15

    5

    10

    15

    5 10 15

    5

    10

    15

    EIGEN D IGITS

    PCA ( ti d )

  • 8/3/2019 Nearest Neighbour and Clustering

    106/122

    PCA (continued)

    Based on the Eigen values first 64 PCs were found to besignificant

    Variance captured ~ 92.74%

    Any image represented by its PC: Y= [y1 y2.....y64 ]

    Reduced Data Matrix with 64 variables

    Y= 1000 x 64 matrix

  • 8/3/2019 Nearest Neighbour and Clustering

    107/122

    0 5 0 1 0 0 1 5 0 2 0 0 2 5 06410

    20

    30

    40

    50

    60

    70

    80

    90

    10 0

    9 2 . 7 4

    No of Pr inciple Com ponents

    %V

    arianceexplained

    (Cumulative)

    Cum ulat ive Percentage var iance e xplained vs No of Pr inciple Com ponents u sed

    64

    9 2 . 7 4

    Interpreting the PCs as Image

  • 8/3/2019 Nearest Neighbour and Clustering

    108/122

    Interpreting the PCs as Image

    F

    eatures The Eigen vectors are the rotation of the originalaxes to more meaningful directions. The PCsare the projection of the data onto each of thesenew axes.

    Image Reconstruction: The original image can be reconstructed by projecting

    the PCs back to old axes.

    Using the most significant PC will give a

    reconstructed image that is close to original image. These features can be used for carrying out further

    investigations e.g. Classification!!

    I R t ti

  • 8/3/2019 Nearest Neighbour and Clustering

    109/122

    Image Reconstruction

    Mean Centered Image: I=(X-Xmean) PC as Features:

    yi = eiI

    Y= [y1, y2,.. y64]

    = EI where E=[e1 e2. e64]

    Reconstruction: Xrecon= E*Y + XmeanACTUAL IM AGE FROM TEST SE T

    5 10 15

    2

    4

    6

    8

    10

    12

    14

    16

    COM PLETELY REC ONSTRUCTED

    IM AGE USING ALL 256 PRICIPLE COMPONENTS

    5 10 1 5

    2

    4

    6

    8

    10

    12

    14

    16

    RECONSTRUCTED IMAGE USING

    15 0 PRICIPLE COMPONENTS

    5 1 0 15

    2

    4

    6

    8

    10

    12

    14

    16

    RECONSTRUCTED IMAGE USING

    64 PRICIPLE COM PONENTS

    5 10 1 5

    2

    4

    6

    8

    10

    12

    14

    16

    N lit t t PC

  • 8/3/2019 Nearest Neighbour and Clustering

    110/122

    Normality test on PCs

    -4 -2 0 2 4-1

    -0. 5

    0

    0. 5

    1

    Standard Normal Quant i les

    Quantiles

    ofInputSam

    ple

    QQ Plot of Sample Data versus Standard Normal

    P r in c ip le Co mp o n en t No 1

    -4 -2 0 2 4-1

    -0. 5

    0

    0. 5

    1

    Standard Normal Quant i les

    Quantiles

    ofInputSam

    ple

    QQ Plot of Sample Data versus Standard Normal

    P r in c ip le Co mp o n en t No 3

    -4 -2 0 2 4-1

    -0. 5

    0

    0. 5

    1

    Standard Normal Quant i les

    Quantiles

    ofInputSam

    ple

    QQ Plot of Sample Data versus Standard Normal

    P r in c ip le Co mp o n en t No 5

    -4 -2 0 2 4-1

    -0. 5

    0

    0. 5

    1

    Standard Normal Quant i les

    Quantiles

    ofInputSam

    ple

    QQ Plot of Sample Data versus Standard Normal

    P r in c ip le Co mp o n en t No 10

    -4 -2 0 2 4-1. 5

    -1

    -0. 5

    0

    0. 5

    1

    1. 5

    Standard Normal Quant i les

    Quantiles

    ofInputSam

    ple

    QQ Plot of Sample Data versus Standard Normal

    P r in c ip le Co mp o n en t No 20

    -4 -2 0 2 4-1. 5

    -1

    -0. 5

    0

    0. 5

    1

    1. 5

    Standard Normal Quant i les

    Quantiles

    ofInputSam

    ple

    QQ Plot of Sample Data versus Standard Normal

    P r in c ip le Co mp o n en t No 30

    -4 -2 0 2 4-3

    -2

    -1

    0

    1

    2

    Standard Normal Quant i les

    Quantiles

    ofInputSam

    ple

    QQ Plot of Sample Data versus Standard Normal

    P r in c ip le Co mp o n en t No 40

    -4 -2 0 2 4-3

    -2

    -1

    0

    1

    2

    3

    Standard Normal Quant i les

    Quantiles

    ofInputSam

    ple

    QQ Plot of Sample Data versus Standard Normal

    P r in c ip le Co mp o n en t No 50

    -4 -2 0 2 4-5

    0

    5

    Standard Normal Quant i les

    Quantiles

    ofInputSam

    ple

    QQ Plot of Sample Data versus Standard Normal

    P r in c ip le Co mp o n en t No 60

    Cl ifi ti

  • 8/3/2019 Nearest Neighbour and Clustering

    111/122

    Classification

    Principle Components used as features of

    images

    LDA assuming multivariate normality of thefeature groups and common covariance

    Fisher discriminant procedure which assumes

    only common covariance

    Cl ifi ti ( td )

  • 8/3/2019 Nearest Neighbour and Clustering

    112/122

    Classification (contd..)

    Equal cost of misclassification

    Misclassification error rate:

    APER based on training data AER on the validation data

    Error rate using different number

    of PCs were compared

    Averaged overseveral random

    sampling of training

    and validation data

    from the full data set.

    Performing LDA

  • 8/3/2019 Nearest Neighbour and Clustering

    113/122

    Performing LDA

    Prior probabilities of each class were taken

    as the frequency of that class in data.

    Equivalence of covariance matrix Strong Assumption

    Error rates used to check validity of

    assumption

    Spooled used for covariance matrix

  • 8/3/2019 Nearest Neighbour and Clustering

    114/122

  • 8/3/2019 Nearest Neighbour and Clustering

    115/122

    Fisher Discriminant Results

  • 8/3/2019 Nearest Neighbour and Clustering

    116/122

    Fisher Discriminant Results

    r=2 discriminants

    APER

    AER

    Both AER and APER are very high

    No of PCs 256 150 64

    APER % 32 34.5 37.4

    No of PCs 256 150 64

    AER % 45 42 40

    Fisher Discriminant Results

  • 8/3/2019 Nearest Neighbour and Clustering

    117/122

    Fisher Discriminant Results

    r=7 discriminants

    APER

    AER

    Considerable improvement in AER and APER

    Performance is close to LDA

    Using 64 PCs is better

    No of PCs 256 150 64

    APER % 3.2 4.8 7.9

    No of PCs 256 150 64

    AER % 14.1 12.4 10.8

    Fisher Discriminant Results

  • 8/3/2019 Nearest Neighbour and Clustering

    118/122

    Fisher Discriminant Results

    r=9(all) discriminants

    APER

    AER

    No significant performance gain from r=7

    Error rates are ~ LDA (as expected!)

    No of PCs 256 150 64

    APER % 1.6 4.3 6.4

    No of PCs 256 150 64

    AER % 13.21 10.55 9.86

    Nearest Neighbour Classifier

  • 8/3/2019 Nearest Neighbour and Clustering

    119/122

    g

    No assumption aboutdistribution of data

    Euclidean distance to findnearest neighbour

    Test point assigned to

    Class 2

    Class 2

    Class 1

    Finds the nearest neighbours from the trainingset to test image and assigns its label to testimage.

    K-Nearest Neighbour Classifier

  • 8/3/2019 Nearest Neighbour and Clustering

    120/122

    g

    (KNN) Compute the k nearest neighbours and

    assign the class by majority vote.

    k= 3

    Test point assigned to

    Class 1

    Class 2 ( 1 vote )

    Class 1 ( 2 votes )

    1-NN Classification Results:

  • 8/3/2019 Nearest Neighbour and Clustering

    121/122

    1 NN Classification Results:

    No of PCs 256 150 64AER % 7.09 7.01 6.45

    Test error rates have improved compared toLDA andFisher

    Using 64 PCs gives better results

    Using higherks does not show improvementin recognition rate

    Misclassification in NN:

  • 8/3/2019 Nearest Neighbour and Clustering

    122/122

    Misclassification in NN:

    0 1 2 3 4 5 6 7 8 9

    0 1376 0 4 2 0 5 12 2 0 0

    1 0 1113 1 0 1 0 2 0 2 0

    2 22 9 728 17 4 4 6 16 18 2

    3 4 0 4 690 2 26 0 4 6 3

    4 3 15 9 0 687 0 7 2 4 32

    5 9 3 12 37 5 517 32 0 23 9

    6 10 3 5 0 3 2 714 0 3 2

    7 0 6 1 0 19 0 0 657 1 20

    8 8 11 1 26 7 7 8 5 547 13

    9 6 1 2 0 23 0 0 32 0 664

    Actual

    Recognised as

    Euclidean distances between transformed

    images of same class can be very high