14vcat

Upload: basit-jasani

Post on 07-Aug-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/20/2019 14vcat

    1/67

     Retrieval

     

    Introduction to

    Information Retrieval

    Hinrich Schütze and Christina LiomaLecture 14: Vector Space

    Classifcation

    1

  • 8/20/2019 14vcat

    2/67

     Retrieval

     

    Overview  !ecap

    "  #eature selection

    $  Intro vector space classifcation

    %  !occhio

    &  '((

    )  Linear classifers

    * + two classes ,

  • 8/20/2019 14vcat

    3/67

     Retrieval

     

    Outline  !ecap

    "  #eature selection

    $  Intro vector space classifcation

    %  !occhio

    &  '((

    )  Linear classifers

    * + two classes -

  • 8/20/2019 14vcat

    4/67

     Retrieval

     

    4

    !elevance .eed/ac': 0asic idea

     he user issues a 2short3 simple 5uer67

     he search en8ine returns a set o. documents7

    9ser mar's some docs as relevant3 some asnonrelevant7

    Search en8ine computes a new representation o.the in.ormation need should /e /etter than the

    initial 5uer67

    Search en8ine runs new 5uer6 and returns newresults7

    (ew results have 2hope.ull6 /etter recall7

  • 8/20/2019 14vcat

    5/67

     Retrieval

     

    ;

    !occhio illustrated

  • 8/20/2019 14vcat

    6/67

     Retrieval

     

    <

     a'e=awa6 toda6

    #eature selection .or te>t classifcation: How toselect a su/set o. availa/le dimensions

    Vector space classifcation: 0asic idea o. doin8te>tclassifcation .or documents that arerepresented as vectors

    !occhio classifer: !occhio relevance .eed/ac'idea applied to te>t classifcation

    k  nearest nei8h/or classifcation

    Linear classifers

    ?ore than two classes

  • 8/20/2019 14vcat

    7/67

     Retrieval

     

    Outline  !ecap 

    "  #eature selection

    $  Intro vector space classifcation

    %  !occhio

    &  '(()  Linear classifers

    *

    + two classes @

  • 8/20/2019 14vcat

    8/67

     Retrieval

     

    A

    #eature selection

    In te>t classifcation3 we usuall6 representdocuments in a hi8h=dimensional space3 witheach dimension correspondin8 to a term7

    In this lecture: a>is B dimension B word B termB .eature

    ?an6 dimensions correspond to rare words7

    !are words can mislead the classifer7

    !are misleadin8 .eatures are called noise.eatures7

    liminatin8 noise .eatures .rom the

    representation increases eDcienc6 andeEectiveness o. te>t classifcation7

  • 8/20/2019 14vcat

    9/67

     Retrieval

     

    F

    >ample .or a noise .eature

    LetGs sa6 weGre doin8 te>t classifcation .or theclass China7

    Suppose a rare term3 sa6 !CH(OC(!IC3 has no

    in.ormation a/out China 7 7 7 7 7 7 /ut all instances o. !CH(OC(!IC happen

    to occur in

    China documents in our trainin8 set7

     hen we ma6 learn a classifer that incorrectl6interprets !CH(OC(!IC as evidence .or theclass China7

    Such an incorrect 8eneralization .rom an

    accidental propert6 o. the trainin8 set is calledoverfttin 7

  • 8/20/2019 14vcat

    10/67

     Retrieval

     

    1

    0asic .eature selection al8orithm

  • 8/20/2019 14vcat

    11/67

     Retrieval

     

    11

    JiEerent .eature selectionmethods

      .eature selection method is mainl6 defned /6the .eature utilit6 measure it emplo6s

    #eature utilit6 measures:

    #re5uenc6 select the most .re5uent terms ?utual in.ormation select the terms with the

    hi8hest mutual in.ormation

    ?utual in.ormation is also called in.ormation 8ainin this conte>t7

    Chi=s5uare 2see /oo'

  • 8/20/2019 14vcat

    12/67

     Retrieval

     

    1,

    ?utual in.ormation

    Compute the .eature utilit6 A2t 3 c as thee>pected mutual in.ormation 2?I o. term t  andclass c7

    ?I tells us Khow much in.ormation the termcontains a/out the class and vice versa7

    #or e>ample3 i. a termGs occurrence isindependent o. the class 2same proportion o.docs withinMwithout class contain the term3 then?I is 7

    Jefnition:

  • 8/20/2019 14vcat

    13/67

     Retrieval

     

    1-

    How to compute ?I values

    0ased on ma>imum li'elihood estimates3 the.ormula we actuall6 use is:

    N10: num/er o. documents that contain t 2et  B 1and are

    not in c 2ecB N N11: num/er o. documents thatcontain t 

    2et  B 1 and are in c 2ec B 1N N1: num/er o.documents

    that do not contain t  2et  

    B 1 and are in c 2ec

    B 1NN:

  • 8/20/2019 14vcat

    14/67

     Retrieval

     

    14

    ?I e>ample .or poultry MPQO! in!euters

  • 8/20/2019 14vcat

    15/67

     Retrieval

     

    1;

    ?I .eature selection on !euters

  • 8/20/2019 14vcat

    16/67

     Retrieval

     

    1<

    (aive 0a6es: Eect o. .eatureselection

    2multinomial Bmultinomial (aive0a6es3 /inomialB 0ernoulli (aive

    0a6es

  • 8/20/2019 14vcat

    17/67

     Retrieval

     

    1@

    #eature selection .or (aive 0a6es

    In 8eneral3 .eature selection is necessar6 .or(aive 0a6es to 8et decent per.ormance7

    lso true .or most other learnin8 methods in te>tclassifcation: 6ou need .eature selection .oroptimal per.ormance7

  • 8/20/2019 14vcat

    18/67

     Retrieval

     

    1A

    >ercise

     2i Compute the Ke>portMQO9L!R contin8enc6 ta/le.or theK6otoMTQ( in the collection 8iven /elow7 2ii ?a'eup a

    contin8enc6 ta/le .or which ?I is that is3 term andclass areindependent o. each other7 Ke>portMQO9L!R ta/le:

  • 8/20/2019 14vcat

    19/67

     Retrieval

     

    Outline  !ecap 

    "  #eature selection

    $  Intro vector space classifcation

    %  !occhio

    &  '(()  Linear classifers

    *

    + two classes 1F

  • 8/20/2019 14vcat

    20/67

     Retrieval

     

    ,

    !ecall vector space representation

     ach document is a vector3 one component .oreach term7

     erms are a>es7

    Hi8h dimensionalit6: 13s o. dimensions

    (ormalize vectors 2documents to unit len8th

    How can we do classifcation in this spaceU

  • 8/20/2019 14vcat

    21/67

     Retrieval

     

    ,1

    Vector space classifcation

    s /e.ore3 the trainin8 set is a set o. documents3each la/eled with its class7

    In vector space classifcation3 this setcorresponds to a la/eled set o. points or vectorsin the vector space7

    Qremise 1: Jocuments in the same class .orm a

    conti8uous re8ion7 Qremise ,: Jocuments .rom diEerent classes

    donGt overlap7

    e defne lines3 sur.aces3 h6persur.aces to

    divide re8ions7

  • 8/20/2019 14vcat

    22/67

     Retrieval

     

    ,,

    Classes in the vector space

    Should the document W /e assi8ned to China3 9 oren6aU #indseparators /etween the classes 0ased on theseseparators: W should

    /e assi8ned to China How do we fnd separators that

  • 8/20/2019 14vcat

    23/67

     Retrieval

     

    ,-

    side: ,JM-J 8raphs can /emisleadin8

    Left : proXection o. the ,J semicircle to 1J7 #or the points x 13 x ,3 x -3 x 43 x ; at > coordinates Y7F3Y7,3 3 7,3 7F the

    distanceZ x , x -Z [ 7,1 onl6 diEers /6 7;\ .rom Z x ], x ]-Z B 7,N /ut

    Z x 1 x -ZMZ x ]1 x ]-Z B dtrueMdproXected [ 17ample o. 

     a lar8e distortion 21A\ when proXectin8 a lar8e area7 Right : he

  • 8/20/2019 14vcat

    24/67

     Retrieval

     

    Outline  !ecap 

    "  #eature selection

    $  Intro vector space classifcation

    %  !occhio

    &  '(()  Linear classifers

    *

    + two classes ,4

    R i l

  • 8/20/2019 14vcat

    25/67

     Retrieval

     

    ,;

    !elevance .eed/ac'

    In relevance .eed/ac'3 the user mar'sdocuments as relevantMnonrelevant7

    !elevantMnonrelevant can /e viewed as classes 

    or cate8ories7

    #or each document3 the user decides which o.these two classes is correct7

     he I! s6stem then uses these class assi8nmentsto /uild a /etter 5uer6 2Kmodel o. thein.ormation need 7 7 7

    7 7 7 and returns /etter documents7

    !elevance .eed/ac' is a .orm o. te>tclassifcation7

    R t i l

  • 8/20/2019 14vcat

    26/67

     Retrieval

     

    ,<

     9sin8 !occhio .or vector spaceclassifcation

     he principal diEerence /etween relevance.eed/ac' and te>t classifcation:

     he trainin8 set is 8iven as part o. the input in te>t

    classifcation7 It is interactivel6 created in relevance .eed/ac'7

     R t i l

  • 8/20/2019 14vcat

    27/67

    Retrieval 

    ,@

     !occhio classifcation: 0asic idea

    Compute a centroid .or each class he centroid is the avera8e o. all documents in the

    class7

    ssi8n each test document to the class o. itsclosest centroid7

     R t i l

     

  • 8/20/2019 14vcat

    28/67

    Retrieval

    ,A

     !ecall defnition o. centroid

     where Dc is the set o. all documents that /elon8 toclass c and  is the vector space representation o. d7

     Retrie al

     

  • 8/20/2019 14vcat

    29/67

    Retrieval

    ,F

     !occhio al8orithm

     Retrieval

     

  • 8/20/2019 14vcat

    30/67

    Retrieval

    -

    !occhio illustrated : a1 B a,3 /1 B /,3c1 B c,

     Retrieval

     

  • 8/20/2019 14vcat

    31/67

    Retrieval

    -1

     !occhio properties

    !occhio .orms a simple representation .or eachclass: the centroid

    e can interpret the centroid as the protot6pe o.the class7

    Classifcation is /ased on similarit6 to M distance.rom centroidMprotot6pe7

    Joes not 8uarantee that classifcations areconsistent with the trainin8 data^

     Retrieval

     

  • 8/20/2019 14vcat

    32/67

    Retrieval

    -,

     ime comple>it6 o. !occhio

     Retrieval

     

  • 8/20/2019 14vcat

    33/67

    Retrieval

    --

     !occhio vs7 (aive 0a6es

    In man6 cases3 !occhio per.orms worse than(aive 0a6es7

    One reason: !occhio does not handle nonconve>3multimodal classes correctl67

     Retrieval

     

  • 8/20/2019 14vcat

    34/67

    Retrieval

    -4

      !occhio cannot handle nonconve>3multimodal classes

    >ercise: h6 is !occhionot e>pected to do well.orthe classifcation tas' avs7

    / hereU is centroid o. the

    aGs3 0 is centroid o.the /Gs7

     he point o is closerto than to 07

    0ut o is a /etter ft.or the / class7

    is a multimodal

    class with twoprotot6pes7

    P Pa

    /

    /

    aa

    a

    /

    a

    /

    /

    / /

    /

    /

    /

    /

    a

    a

    a

    a a

    aa a

    aa a

    a a

    a

    a

    //

    /

    a

    a

    a

    a

     Retrieval

     

  • 8/20/2019 14vcat

    35/67

    Retrieval

    Outline  !ecap 

    "  #eature selection

    $  Intro vector space classifcation%  !occhio

    &

     '(()  Linear classifers

    *

    + two classes -;

     Retrieval

     

  • 8/20/2019 14vcat

    36/67

    Retrieval

    -<

     '(( classifcation

    '(( classifcation is another vector spaceclassifcation method7

    It also is ver6 simple and eas6 to implement7

    '(( is more accurate 2in most cases than (aive0a6es and !occhio7

    I. 6ou need to 8et a prett6 accurate classifer upand runnin8 in a short time 7 7 7

    7 7 7 and 6ou donGt care a/out eDcienc6 thatmuch 7 7 7

    7 7 7 use '((7

     Retrieval

     

  • 8/20/2019 14vcat

    37/67

    Retrieval

    -@

     '(( classifcation '(( B k  nearest nei8h/ors '(( classifcation rule .or k B 1 21((: ssi8n

    each test document to the class o. its nearestnei8h/or in the trainin8 set7

    1(( is not ver6 ro/ust one document can /emisla/eled or at6pical7

    '(( classifcation rule .or k + 1 2'((: ssi8neach test document to the maXorit6 class o. its k  

    nearest nei8h/ors in the trainin8 set7 !ationale o. '((: conti8uit6 h6pothesis

    e e>pect a test document d to have the samela/el as the trainin8 documents located in the

    local re8ion surroundin8 d7

     Retrieval

     

  • 8/20/2019 14vcat

    38/67

    Retrieval

    -A

     Qro/a/ilistic '((

    Qro/a/ilistic version o. '((: P2cZd B .raction o. k  

    nei8h/ors o. d that are in c '(( classifcation rule .or pro/a/ilistic '((:

    ssi8n d to class c with hi8hest P2cZd

     Retrieval

     

  • 8/20/2019 14vcat

    39/67

    Retrieval

    -F

     Qro/a/ilistic '((

    1((3 -((classifcationdecision.or starU

     Retrieval

     

  • 8/20/2019 14vcat

    40/67

    Retrieval

    4

     '(( al8orithm

     Retrieval

     

  • 8/20/2019 14vcat

    41/67

    Retrieval

    41

    >ercise

    How is star classifed /6:

    2i 1=(( 2ii -=(( 2iii F=(( 2iv 1;=(( 2v !occhioU

     Retrieval

     

  • 8/20/2019 14vcat

    42/67

    e e a

    4,

     ime comple>it6 o. '((

      kNN with preprocessing of training set trainin8 testin8

    '(( test time proportional to the size o. thetrainin8 set^

     he lar8er the trainin8 set3 the lon8er it ta'es to

    classi.6 a test document7 '(( is ineDcient .or ver6 lar8e trainin8 sets7

     Retrieval

     

  • 8/20/2019 14vcat

    43/67

    4-

    '((: Jiscussion

    (o trainin8 necessar6

    0ut linear preprocessin8 o. documents is ase>pensive as trainin8 (aive 0a6es7

    e alwa6s preprocess the trainin8 set3 so in realit6trainin8 time o. '(( is linear7

    '(( is ver6 accurate i. trainin8 set is lar8e7

    Optimalit6 result: as6mptoticall6 zero error i.

    0a6es rate is zero7 0ut '(( can /e ver6 inaccurate i. trainin8 set is

    small7

     Retrieval

     

  • 8/20/2019 14vcat

    44/67

    Outline  !ecap 

    "  #eature selection

    $  Intro vector space classifcation%  !occhio

    &

     '(()  Linear classifers

    *

    + two classes 44

     Retrieval  

  • 8/20/2019 14vcat

    45/67

    4;

    Linear classifers Jefnition:

    linear classifer computes a linear com/inationor wei8hted sum o. the .eature values7

    Classifcation decision:

    7 7 7where 2the threshold is a parameter7

    2#irst3 we onl6 consider /inar6 classifers7

    _eometricall63 this corresponds to a line 2,J3 aplane 2-J or a h6perplane 2hi8her

    dimensionalities3 the separator7

    e fnd this separator /ased on trainin8 set7

    ?ethods .or fndin8 separator: Qerceptron3!occhio3 (a`ve 0a6es as we will e>plain on thene>t slides

     Retrieval  

  • 8/20/2019 14vcat

    46/67

    4<

      linear classifer in 1J

    linear classifer in1J is a pointdescri/ed /6 the

    e5uation

    1d

    1 B  he point at M1

    Qoints 2d1 with 1d1 

    b are in the class c7

    Qoints 2d1 with 1d1 

    are in thecomplement class

     Retrieval  

  • 8/20/2019 14vcat

    47/67

    4@

      linear classifer in ,J

    linear classifer in,J is a linedescri/ed /6 the

    e5uation

    1d

    ,d

    , B

    >ample .or a ,Jlinear classifer

    Qoints 2d1 d, with1d1  ,d, b are

    in the class c7

    Qoints 2d1 d, with

    1d1  ,d,  are

     Retrieval  

  • 8/20/2019 14vcat

    48/67

    4A

      linear classifer in ,J

      linear classifer in-J is a planedescri/ed /6 the

    e5uation1d1

    ,d,

    -d- B >ample .or a -J

    linear classifer Qoints 2d1d,d- with

    1d1 ,d, -d- b are in the class c7

    Qoints 2d1d,d- with1d1 ,d, -d-

    are in the

     Retrieval  

  • 8/20/2019 14vcat

    49/67

    4F

      !occhio as a linear classifer

    !occhio is a linear classifer defned /6:

    where is the normal vectorand

     Retrieval  

  • 8/20/2019 14vcat

    50/67

    ;

      (aive 0a6es as a linear classifer

    ?ultinomial (aive 0a6es is a linear classifer 2in lo8space defned/6:

    where 3 di B num/er o.

    occurrences o. t iin d3 and 7 Here3 the inde> i 31  i !3re.ers to terms o. the voca/ular6 2not to positions in das k  did in

    our ori8inal defnition o. (aive 0a6es

     Retrieval  

  • 8/20/2019 14vcat

    51/67

    ;1

      '(( is not a linear classifer

     Classifcationdecision /ased onmaXorit6 o. k nearestnei8h/ors7

     he decision/oundaries /etweenclasses arepiecewise linear 7 7 7

    7 7 7 /ut the6 are in8eneral not linearclassifers that can/e descri/ed as

     Retrieval  

    l . li t l

  • 8/20/2019 14vcat

    52/67

    ;,

    >ample o. a linear two=classclassifer

     his is .or the class intere"t  in !euters=,1;@A7 #or simplicit6: assume a simple M1 vector representation d1: Krate discount dlrs world

    d,: Kprime dlrs

    B >ercise: hich class is d1 assi8ned toU hich class is d, 

    assi8ned toU e assi8n document Krate discount dlrs world tointere"t  since

      B 7

  • 8/20/2019 14vcat

    53/67

    ;-

    hich h6perplaneU

     Retrieval  

    L i l i h .

  • 8/20/2019 14vcat

    54/67

    ;4

     Learnin8 al8orithms .or vector spaceclassifcation

    In terms o. actual computation3 there are twot6pes o. learnin8 al8orithms7

    2i Simple learnin8 al8orithms that estimate theparameters o. the classifer directl6 .rom the

    trainin8 data3 o.ten in one linear pass7 (aive 0a6es3 !occhio3 '(( are all e>amples o. this7

    2ii Iterative al8orithms

    Support vector machines

    Qerceptron 2e>ample availa/le as QJ# on we/site:http:MMi.nlp7or8MirMpd.Mp7pd.

     he /est per.ormin8 learnin8 al8orithms usuall6re5uire iterative learnin87

     Retrieval  

  • 8/20/2019 14vcat

    55/67

    ;;

    hich h6perplaneU

     Retrieval  

  • 8/20/2019 14vcat

    56/67

    ;<

      hich h6perplaneU

     #or linearl6 separa/le trainin8 sets: there are infnitel6 man6 separatin8 h6perplanes7

     he6 all separate the trainin8 set per.ectl6 7 7 7

    7 7 7 /ut the6 /ehave diEerentl6 on test data7

    rror rates on new data are low .or some3 hi8h.or others7

    How do we fnd a low=error separatorU

    Qerceptron: 8enerall6 /adN (aive 0a6es3 !occhio:o'N linear SV?: 8ood

     Retrieval  

  • 8/20/2019 14vcat

    57/67

    ;@

      Linear classifers: Jiscussion

    ?an6 common te>t classifers are linearclassifers: (aive 0a6es3 !occhio3 lo8isticre8ression3 linear support vector machines etc7

    ach method has a diEerent wa6 o. selectin8 theseparatin8 h6perplane

    Hu8e diEerences in per.ormance on testdocuments

    Can we 8et /etter per.ormance with more

    power.ul nonlinear classifersU

    (ot in 8eneral: 8iven amount o. trainin8 datama6 suDce .or estimatin8 a linear /oundar63 /utnot .or estimatin8 a more comple> nonlinear

    /oundar67

     Retrieval  

  • 8/20/2019 14vcat

    58/67

    ;A

      nonlinear pro/lem

    Linear classifer li'e !occhio does /adl6 on thistas'7

    '(( will do well 2assumin8 enou8h trainin8 data

     Retrieval  

    hi h l if d I . i C

  • 8/20/2019 14vcat

    59/67

    ;F

      hich classifer do I use .or a 8iven Cpro/lemU

    Is there a learnin8 method that is optimal .or allte>t classifcation pro/lemsU

    (o3 /ecause there is a tradeoE /etween /ias andvariance7

    #actors to ta'e into account:

    How much trainin8 data is availa/leU

    How simpleMcomple> is the pro/lemU 2linear vs7nonlinear decision /oundar6

    How nois6 is the pro/lemU

    How sta/le is the pro/lem over timeU

    #or an unsta/le pro/lem3 itGs /etter to use asimple and ro/ust classifer7

     Retrieval  

  • 8/20/2019 14vcat

    60/67

    Outline

     !ecap 

    "  #eature selection

    $  Intro vector space classifcation%  !occhio

    &

     '(()  Linear classifers

    * + two classes

  • 8/20/2019 14vcat

    61/67

  • 8/20/2019 14vcat

    62/67

    clusive7

    ach document /elon8s to e>actl6 one class7 >ample: lan8ua8e o. a document 2assumption: no

    document

    contains multiple lan8ua8es

     Retrieval  

    One o. classifcation with linear

  • 8/20/2019 14vcat

    63/67

  • 8/20/2019 14vcat

    64/67

    ample: topic classifcation

    9suall6: ma'e decisions on the re8ion3 on thesu/Xect area3 on the industr6 and so onKindependentl6

     Retrieval  

    n6 o. classifcation with linear

  • 8/20/2019 14vcat

    65/67

  • 8/20/2019 14vcat

    66/67

    t classifcation

    k  nearest nei8h/or classifcation

    Linear classifers

    ?ore than two classes

     Retrieval  

  • 8/20/2019 14vcat

    67/67

      !esources

    Chapter 1- o. II! 2.eature selection

    Chapter 14 o. II!

    !esources at http://ifnlp.org/ir

    Qerceptron e>ample

    _eneral overview o. te>t classifcation: Se/astiani2,,

     e>t classifcation chapter on decision tress and

    perceptrons: ?annin8 Schütze 21FFF One o. the /est machine learnin8 te>t/oo's:

    Hastie3 i/shirani #riedman 2,-