14vcat
TRANSCRIPT
-
8/20/2019 14vcat
1/67
Retrieval
Introduction to
Information Retrieval
Hinrich Schütze and Christina LiomaLecture 14: Vector Space
Classifcation
1
-
8/20/2019 14vcat
2/67
Retrieval
Overview !ecap
" #eature selection
$ Intro vector space classifcation
% !occhio
& '((
) Linear classifers
* + two classes ,
-
8/20/2019 14vcat
3/67
Retrieval
Outline !ecap
" #eature selection
$ Intro vector space classifcation
% !occhio
& '((
) Linear classifers
* + two classes -
-
8/20/2019 14vcat
4/67
Retrieval
4
!elevance .eed/ac': 0asic idea
he user issues a 2short3 simple 5uer67
he search en8ine returns a set o. documents7
9ser mar's some docs as relevant3 some asnonrelevant7
Search en8ine computes a new representation o.the in.ormation need should /e /etter than the
initial 5uer67
Search en8ine runs new 5uer6 and returns newresults7
(ew results have 2hope.ull6 /etter recall7
-
8/20/2019 14vcat
5/67
Retrieval
;
!occhio illustrated
-
8/20/2019 14vcat
6/67
Retrieval
<
a'e=awa6 toda6
#eature selection .or te>t classifcation: How toselect a su/set o. availa/le dimensions
Vector space classifcation: 0asic idea o. doin8te>tclassifcation .or documents that arerepresented as vectors
!occhio classifer: !occhio relevance .eed/ac'idea applied to te>t classifcation
k nearest nei8h/or classifcation
Linear classifers
?ore than two classes
-
8/20/2019 14vcat
7/67
Retrieval
Outline !ecap
" #eature selection
$ Intro vector space classifcation
% !occhio
& '(() Linear classifers
*
+ two classes @
-
8/20/2019 14vcat
8/67
Retrieval
A
#eature selection
In te>t classifcation3 we usuall6 representdocuments in a hi8h=dimensional space3 witheach dimension correspondin8 to a term7
In this lecture: a>is B dimension B word B termB .eature
?an6 dimensions correspond to rare words7
!are words can mislead the classifer7
!are misleadin8 .eatures are called noise.eatures7
liminatin8 noise .eatures .rom the
representation increases eDcienc6 andeEectiveness o. te>t classifcation7
-
8/20/2019 14vcat
9/67
Retrieval
F
>ample .or a noise .eature
LetGs sa6 weGre doin8 te>t classifcation .or theclass China7
Suppose a rare term3 sa6 !CH(OC(!IC3 has no
in.ormation a/out China 7 7 7 7 7 7 /ut all instances o. !CH(OC(!IC happen
to occur in
China documents in our trainin8 set7
hen we ma6 learn a classifer that incorrectl6interprets !CH(OC(!IC as evidence .or theclass China7
Such an incorrect 8eneralization .rom an
accidental propert6 o. the trainin8 set is calledoverfttin 7
-
8/20/2019 14vcat
10/67
Retrieval
1
0asic .eature selection al8orithm
-
8/20/2019 14vcat
11/67
Retrieval
11
JiEerent .eature selectionmethods
.eature selection method is mainl6 defned /6the .eature utilit6 measure it emplo6s
#eature utilit6 measures:
#re5uenc6 select the most .re5uent terms ?utual in.ormation select the terms with the
hi8hest mutual in.ormation
?utual in.ormation is also called in.ormation 8ainin this conte>t7
Chi=s5uare 2see /oo'
-
8/20/2019 14vcat
12/67
Retrieval
1,
?utual in.ormation
Compute the .eature utilit6 A2t 3 c as thee>pected mutual in.ormation 2?I o. term t andclass c7
?I tells us Khow much in.ormation the termcontains a/out the class and vice versa7
#or e>ample3 i. a termGs occurrence isindependent o. the class 2same proportion o.docs withinMwithout class contain the term3 then?I is 7
Jefnition:
-
8/20/2019 14vcat
13/67
Retrieval
1-
How to compute ?I values
0ased on ma>imum li'elihood estimates3 the.ormula we actuall6 use is:
N10: num/er o. documents that contain t 2et B 1and are
not in c 2ecB N N11: num/er o. documents thatcontain t
2et B 1 and are in c 2ec B 1N N1: num/er o.documents
that do not contain t 2et
B 1 and are in c 2ec
B 1NN:
-
8/20/2019 14vcat
14/67
Retrieval
14
?I e>ample .or poultry MPQO! in!euters
-
8/20/2019 14vcat
15/67
Retrieval
1;
?I .eature selection on !euters
-
8/20/2019 14vcat
16/67
Retrieval
1<
(aive 0a6es: Eect o. .eatureselection
2multinomial Bmultinomial (aive0a6es3 /inomialB 0ernoulli (aive
0a6es
-
8/20/2019 14vcat
17/67
Retrieval
1@
#eature selection .or (aive 0a6es
In 8eneral3 .eature selection is necessar6 .or(aive 0a6es to 8et decent per.ormance7
lso true .or most other learnin8 methods in te>tclassifcation: 6ou need .eature selection .oroptimal per.ormance7
-
8/20/2019 14vcat
18/67
Retrieval
1A
>ercise
2i Compute the Ke>portMQO9L!R contin8enc6 ta/le.or theK6otoMTQ( in the collection 8iven /elow7 2ii ?a'eup a
contin8enc6 ta/le .or which ?I is that is3 term andclass areindependent o. each other7 Ke>portMQO9L!R ta/le:
-
8/20/2019 14vcat
19/67
Retrieval
Outline !ecap
" #eature selection
$ Intro vector space classifcation
% !occhio
& '(() Linear classifers
*
+ two classes 1F
-
8/20/2019 14vcat
20/67
Retrieval
,
!ecall vector space representation
ach document is a vector3 one component .oreach term7
erms are a>es7
Hi8h dimensionalit6: 13s o. dimensions
(ormalize vectors 2documents to unit len8th
How can we do classifcation in this spaceU
-
8/20/2019 14vcat
21/67
Retrieval
,1
Vector space classifcation
s /e.ore3 the trainin8 set is a set o. documents3each la/eled with its class7
In vector space classifcation3 this setcorresponds to a la/eled set o. points or vectorsin the vector space7
Qremise 1: Jocuments in the same class .orm a
conti8uous re8ion7 Qremise ,: Jocuments .rom diEerent classes
donGt overlap7
e defne lines3 sur.aces3 h6persur.aces to
divide re8ions7
-
8/20/2019 14vcat
22/67
Retrieval
,,
Classes in the vector space
Should the document W /e assi8ned to China3 9 oren6aU #indseparators /etween the classes 0ased on theseseparators: W should
/e assi8ned to China How do we fnd separators that
-
8/20/2019 14vcat
23/67
Retrieval
,-
side: ,JM-J 8raphs can /emisleadin8
Left : proXection o. the ,J semicircle to 1J7 #or the points x 13 x ,3 x -3 x 43 x ; at > coordinates Y7F3Y7,3 3 7,3 7F the
distanceZ x , x -Z [ 7,1 onl6 diEers /6 7;\ .rom Z x ], x ]-Z B 7,N /ut
Z x 1 x -ZMZ x ]1 x ]-Z B dtrueMdproXected [ 17ample o.
a lar8e distortion 21A\ when proXectin8 a lar8e area7 Right : he
-
8/20/2019 14vcat
24/67
Retrieval
Outline !ecap
" #eature selection
$ Intro vector space classifcation
% !occhio
& '(() Linear classifers
*
+ two classes ,4
R i l
-
8/20/2019 14vcat
25/67
Retrieval
,;
!elevance .eed/ac'
In relevance .eed/ac'3 the user mar'sdocuments as relevantMnonrelevant7
!elevantMnonrelevant can /e viewed as classes
or cate8ories7
#or each document3 the user decides which o.these two classes is correct7
he I! s6stem then uses these class assi8nmentsto /uild a /etter 5uer6 2Kmodel o. thein.ormation need 7 7 7
7 7 7 and returns /etter documents7
!elevance .eed/ac' is a .orm o. te>tclassifcation7
R t i l
-
8/20/2019 14vcat
26/67
Retrieval
,<
9sin8 !occhio .or vector spaceclassifcation
he principal diEerence /etween relevance.eed/ac' and te>t classifcation:
he trainin8 set is 8iven as part o. the input in te>t
classifcation7 It is interactivel6 created in relevance .eed/ac'7
R t i l
-
8/20/2019 14vcat
27/67
Retrieval
,@
!occhio classifcation: 0asic idea
Compute a centroid .or each class he centroid is the avera8e o. all documents in the
class7
ssi8n each test document to the class o. itsclosest centroid7
R t i l
-
8/20/2019 14vcat
28/67
Retrieval
,A
!ecall defnition o. centroid
where Dc is the set o. all documents that /elon8 toclass c and is the vector space representation o. d7
Retrie al
-
8/20/2019 14vcat
29/67
Retrieval
,F
!occhio al8orithm
Retrieval
-
8/20/2019 14vcat
30/67
Retrieval
-
!occhio illustrated : a1 B a,3 /1 B /,3c1 B c,
Retrieval
-
8/20/2019 14vcat
31/67
Retrieval
-1
!occhio properties
!occhio .orms a simple representation .or eachclass: the centroid
e can interpret the centroid as the protot6pe o.the class7
Classifcation is /ased on similarit6 to M distance.rom centroidMprotot6pe7
Joes not 8uarantee that classifcations areconsistent with the trainin8 data^
Retrieval
-
8/20/2019 14vcat
32/67
Retrieval
-,
ime comple>it6 o. !occhio
Retrieval
-
8/20/2019 14vcat
33/67
Retrieval
--
!occhio vs7 (aive 0a6es
In man6 cases3 !occhio per.orms worse than(aive 0a6es7
One reason: !occhio does not handle nonconve>3multimodal classes correctl67
Retrieval
-
8/20/2019 14vcat
34/67
Retrieval
-4
!occhio cannot handle nonconve>3multimodal classes
>ercise: h6 is !occhionot e>pected to do well.orthe classifcation tas' avs7
/ hereU is centroid o. the
aGs3 0 is centroid o.the /Gs7
he point o is closerto than to 07
0ut o is a /etter ft.or the / class7
is a multimodal
class with twoprotot6pes7
P Pa
/
/
0
aa
a
/
a
/
/
/ /
/
/
/
/
a
a
O
a
a a
aa a
aa a
a a
a
a
//
/
a
a
a
a
Retrieval
-
8/20/2019 14vcat
35/67
Retrieval
Outline !ecap
" #eature selection
$ Intro vector space classifcation% !occhio
&
'(() Linear classifers
*
+ two classes -;
Retrieval
-
8/20/2019 14vcat
36/67
Retrieval
-<
'(( classifcation
'(( classifcation is another vector spaceclassifcation method7
It also is ver6 simple and eas6 to implement7
'(( is more accurate 2in most cases than (aive0a6es and !occhio7
I. 6ou need to 8et a prett6 accurate classifer upand runnin8 in a short time 7 7 7
7 7 7 and 6ou donGt care a/out eDcienc6 thatmuch 7 7 7
7 7 7 use '((7
Retrieval
-
8/20/2019 14vcat
37/67
Retrieval
-@
'(( classifcation '(( B k nearest nei8h/ors '(( classifcation rule .or k B 1 21((: ssi8n
each test document to the class o. its nearestnei8h/or in the trainin8 set7
1(( is not ver6 ro/ust one document can /emisla/eled or at6pical7
'(( classifcation rule .or k + 1 2'((: ssi8neach test document to the maXorit6 class o. its k
nearest nei8h/ors in the trainin8 set7 !ationale o. '((: conti8uit6 h6pothesis
e e>pect a test document d to have the samela/el as the trainin8 documents located in the
local re8ion surroundin8 d7
Retrieval
-
8/20/2019 14vcat
38/67
Retrieval
-A
Qro/a/ilistic '((
Qro/a/ilistic version o. '((: P2cZd B .raction o. k
nei8h/ors o. d that are in c '(( classifcation rule .or pro/a/ilistic '((:
ssi8n d to class c with hi8hest P2cZd
Retrieval
-
8/20/2019 14vcat
39/67
Retrieval
-F
Qro/a/ilistic '((
1((3 -((classifcationdecision.or starU
Retrieval
-
8/20/2019 14vcat
40/67
Retrieval
4
'(( al8orithm
Retrieval
-
8/20/2019 14vcat
41/67
Retrieval
41
>ercise
How is star classifed /6:
2i 1=(( 2ii -=(( 2iii F=(( 2iv 1;=(( 2v !occhioU
Retrieval
-
8/20/2019 14vcat
42/67
e e a
4,
ime comple>it6 o. '((
kNN with preprocessing of training set trainin8 testin8
'(( test time proportional to the size o. thetrainin8 set^
he lar8er the trainin8 set3 the lon8er it ta'es to
classi.6 a test document7 '(( is ineDcient .or ver6 lar8e trainin8 sets7
Retrieval
-
8/20/2019 14vcat
43/67
4-
'((: Jiscussion
(o trainin8 necessar6
0ut linear preprocessin8 o. documents is ase>pensive as trainin8 (aive 0a6es7
e alwa6s preprocess the trainin8 set3 so in realit6trainin8 time o. '(( is linear7
'(( is ver6 accurate i. trainin8 set is lar8e7
Optimalit6 result: as6mptoticall6 zero error i.
0a6es rate is zero7 0ut '(( can /e ver6 inaccurate i. trainin8 set is
small7
Retrieval
-
8/20/2019 14vcat
44/67
Outline !ecap
" #eature selection
$ Intro vector space classifcation% !occhio
&
'(() Linear classifers
*
+ two classes 44
Retrieval
-
8/20/2019 14vcat
45/67
4;
Linear classifers Jefnition:
linear classifer computes a linear com/inationor wei8hted sum o. the .eature values7
Classifcation decision:
7 7 7where 2the threshold is a parameter7
2#irst3 we onl6 consider /inar6 classifers7
_eometricall63 this corresponds to a line 2,J3 aplane 2-J or a h6perplane 2hi8her
dimensionalities3 the separator7
e fnd this separator /ased on trainin8 set7
?ethods .or fndin8 separator: Qerceptron3!occhio3 (a`ve 0a6es as we will e>plain on thene>t slides
Retrieval
-
8/20/2019 14vcat
46/67
4<
linear classifer in 1J
linear classifer in1J is a pointdescri/ed /6 the
e5uation
1d
1 B he point at M1
Qoints 2d1 with 1d1
b are in the class c7
Qoints 2d1 with 1d1
are in thecomplement class
Retrieval
-
8/20/2019 14vcat
47/67
4@
linear classifer in ,J
linear classifer in,J is a linedescri/ed /6 the
e5uation
1d
1
,d
, B
>ample .or a ,Jlinear classifer
Qoints 2d1 d, with1d1 ,d, b are
in the class c7
Qoints 2d1 d, with
1d1 ,d, are
Retrieval
-
8/20/2019 14vcat
48/67
4A
linear classifer in ,J
linear classifer in-J is a planedescri/ed /6 the
e5uation1d1
,d,
-d- B >ample .or a -J
linear classifer Qoints 2d1d,d- with
1d1 ,d, -d- b are in the class c7
Qoints 2d1d,d- with1d1 ,d, -d-
are in the
Retrieval
-
8/20/2019 14vcat
49/67
4F
!occhio as a linear classifer
!occhio is a linear classifer defned /6:
where is the normal vectorand
Retrieval
-
8/20/2019 14vcat
50/67
;
(aive 0a6es as a linear classifer
?ultinomial (aive 0a6es is a linear classifer 2in lo8space defned/6:
where 3 di B num/er o.
occurrences o. t iin d3 and 7 Here3 the inde> i 31 i !3re.ers to terms o. the voca/ular6 2not to positions in das k did in
our ori8inal defnition o. (aive 0a6es
Retrieval
-
8/20/2019 14vcat
51/67
;1
'(( is not a linear classifer
Classifcationdecision /ased onmaXorit6 o. k nearestnei8h/ors7
he decision/oundaries /etweenclasses arepiecewise linear 7 7 7
7 7 7 /ut the6 are in8eneral not linearclassifers that can/e descri/ed as
Retrieval
l . li t l
-
8/20/2019 14vcat
52/67
;,
>ample o. a linear two=classclassifer
his is .or the class intere"t in !euters=,1;@A7 #or simplicit6: assume a simple M1 vector representation d1: Krate discount dlrs world
d,: Kprime dlrs
B >ercise: hich class is d1 assi8ned toU hich class is d,
assi8ned toU e assi8n document Krate discount dlrs world tointere"t since
B 7
-
8/20/2019 14vcat
53/67
;-
hich h6perplaneU
Retrieval
L i l i h .
-
8/20/2019 14vcat
54/67
;4
Learnin8 al8orithms .or vector spaceclassifcation
In terms o. actual computation3 there are twot6pes o. learnin8 al8orithms7
2i Simple learnin8 al8orithms that estimate theparameters o. the classifer directl6 .rom the
trainin8 data3 o.ten in one linear pass7 (aive 0a6es3 !occhio3 '(( are all e>amples o. this7
2ii Iterative al8orithms
Support vector machines
Qerceptron 2e>ample availa/le as QJ# on we/site:http:MMi.nlp7or8MirMpd.Mp7pd.
he /est per.ormin8 learnin8 al8orithms usuall6re5uire iterative learnin87
Retrieval
-
8/20/2019 14vcat
55/67
;;
hich h6perplaneU
Retrieval
-
8/20/2019 14vcat
56/67
;<
hich h6perplaneU
#or linearl6 separa/le trainin8 sets: there are infnitel6 man6 separatin8 h6perplanes7
he6 all separate the trainin8 set per.ectl6 7 7 7
7 7 7 /ut the6 /ehave diEerentl6 on test data7
rror rates on new data are low .or some3 hi8h.or others7
How do we fnd a low=error separatorU
Qerceptron: 8enerall6 /adN (aive 0a6es3 !occhio:o'N linear SV?: 8ood
Retrieval
-
8/20/2019 14vcat
57/67
;@
Linear classifers: Jiscussion
?an6 common te>t classifers are linearclassifers: (aive 0a6es3 !occhio3 lo8isticre8ression3 linear support vector machines etc7
ach method has a diEerent wa6 o. selectin8 theseparatin8 h6perplane
Hu8e diEerences in per.ormance on testdocuments
Can we 8et /etter per.ormance with more
power.ul nonlinear classifersU
(ot in 8eneral: 8iven amount o. trainin8 datama6 suDce .or estimatin8 a linear /oundar63 /utnot .or estimatin8 a more comple> nonlinear
/oundar67
Retrieval
-
8/20/2019 14vcat
58/67
;A
nonlinear pro/lem
Linear classifer li'e !occhio does /adl6 on thistas'7
'(( will do well 2assumin8 enou8h trainin8 data
Retrieval
hi h l if d I . i C
-
8/20/2019 14vcat
59/67
;F
hich classifer do I use .or a 8iven Cpro/lemU
Is there a learnin8 method that is optimal .or allte>t classifcation pro/lemsU
(o3 /ecause there is a tradeoE /etween /ias andvariance7
#actors to ta'e into account:
How much trainin8 data is availa/leU
How simpleMcomple> is the pro/lemU 2linear vs7nonlinear decision /oundar6
How nois6 is the pro/lemU
How sta/le is the pro/lem over timeU
#or an unsta/le pro/lem3 itGs /etter to use asimple and ro/ust classifer7
Retrieval
-
8/20/2019 14vcat
60/67
Outline
!ecap
" #eature selection
$ Intro vector space classifcation% !occhio
&
'(() Linear classifers
* + two classes
-
8/20/2019 14vcat
61/67
-
8/20/2019 14vcat
62/67
clusive7
ach document /elon8s to e>actl6 one class7 >ample: lan8ua8e o. a document 2assumption: no
document
contains multiple lan8ua8es
Retrieval
One o. classifcation with linear
-
8/20/2019 14vcat
63/67
-
8/20/2019 14vcat
64/67
ample: topic classifcation
9suall6: ma'e decisions on the re8ion3 on thesu/Xect area3 on the industr6 and so onKindependentl6
Retrieval
n6 o. classifcation with linear
-
8/20/2019 14vcat
65/67
-
8/20/2019 14vcat
66/67
t classifcation
k nearest nei8h/or classifcation
Linear classifers
?ore than two classes
Retrieval
-
8/20/2019 14vcat
67/67
!esources
Chapter 1- o. II! 2.eature selection
Chapter 14 o. II!
!esources at http://ifnlp.org/ir
Qerceptron e>ample
_eneral overview o. te>t classifcation: Se/astiani2,,
e>t classifcation chapter on decision tress and
perceptrons: ?annin8 Schütze 21FFF One o. the /est machine learnin8 te>t/oo's:
Hastie3 i/shirani #riedman 2,-