structural text features - university of...
TRANSCRIPT
4/7/09
1
StructuralTextFeatures
CISC489/689‐010,Lecture#13Monday,April6th
BenCartereGe
StructuralFeatures
• Sofarwehavemainlyfocusedon“vanilla”featuresoftermsindocuments– Termfrequency,documentfrequency– “Bagofwords”models
• Somedocumentshavestructurethatwecouldleverageforimprovedretrieval– Naturallanguagehasstructureaswell
• Wecanderivefeaturesfromthisstructure,especiallyfromtheplacementoftermswithinstructureorplacementoftermswithrespecttoeachother
4/7/09
2
Example:HTML
• “HyperTextMarkupLanguage”• Providesdocumentstructureusingtagsenclosingtext– <Ytle>:enclosedtextdisplayedattopofbrowser– <body>:enclosedtextdisplayedinbrowser– <h1>:enclosedtextdisplayedinlargefont– <b>:enclosedtextdisplayedinbold– <a>:enclosedtextcanbeclickedtogotoanotherpage
• Thetextenclosedinfieldsiso]enunstructuredorstructuredwithmoreHTML
Example:HTML
4/7/09
3
Example:HTML
• HTMLpagesorganizeintotrees.
<HTML>
<HEAD>
<TITLE> Tropicalfish
<META>
<BODY>
<H1> Tropicalfish
<P>
<B> Tropicalfish
<A> fish
<A> tropical
includefoundinenvironmentsaroundtheworld
Nodescontainblocksoftext.
Example:Email
• Headerfieldsprovidesomestructure
4/7/09
4
StructureinNaturalLanguage
• Oneexample:parsetrees
(fromhGp://www.lancs.ac.uk/fss/courses/ling/corpus/Corpus2/2PARSE.HTM)
Hyper‐Structure
• Thedocumentsthemselvesmayoccurwithinsomestructure– Theweb:documentslinktoeachother,creaYngagraphstructure
– Email:threadedconversaYons– Sentencesformparagraphs,paragraphsformsecYons,secYonsformchapters,chaptersformbooks,…
• Thisstructuremayprovideusefulfeatures
4/7/09
5
UsingStructuralFeaturesinRetrieval
• Steps:– Derivefeatures–documentprocessing
– Indexfeatures–usinginvertedlists– Retrievalusingfeatures–retrievalmodels,scoringfuncYons,querylanguages
SpecificFeatures
• Phrases:– Sequencesofwordsinorder– Userswanttoqueryphrases,e.g.“tropicalfish”
• Fieldsandtags:– Markupenclosingpartsofdocuments– Wewanttoemphasizesomeparts,de‐emphasizeothers.E.g.
Ytlesimportant,sidebarsnot• Webhyper‐structure:
– Linksbetweenpages– Wewantpagesthatarefrequentlylinkedusingthesametextto
scorehigherforqueriesthatcontainthattext• Whatarethefeatures,howdowederivethem,howdowe
storethem,andhowdowemodeltheminretrieval?
4/7/09
6
DerivingandIndexingFeatures
• DerivaYonconsideraYons:– ComputaYonalYmeandspacerequirements– Errorsinprocessing– Useinqueries
• IndexingconsideraYons:– Fastqueryprocessing– Flexibility(indexoncewithallinfoforcalculaYnganythingyoucanimaginevs.re‐indexeveryYmeyoucomeupwithanewidea)
– Storage
Phrases• Manyqueriesare2‐3wordphrases• Phrasesare– Moreprecisethansinglewords
• e.g.,documentscontaining“blacksea”vs.twowords“black”and“sea”
– Lessambiguous• e.g.,“bigapple”vs.“apple”
• Canbedifficultforranking• e.g.,Givenquery“fishingsupplies”,howdowescoredocumentswith– exactphrasemanyYmes,exactphrasejustonce,individualwordsinsamesentence,sameparagraph,wholedocument,variaYonsonwords?
4/7/09
7
Phrases
• Textprocessingissue–howarephrasesrecognized?
• Threepossibleapproaches:– IdenYfysyntacYcphrasesusingapart‐of‐speech(POS)tagger
– Usewordn‐grams – StorewordposiYonsinindexesanduseproximityoperatorsinqueries
POSTagging
• POStaggersusestaYsYcalmodelsoftexttopredictsyntacYctagsofwords– Exampletags:• NN(singularnoun),NNS(pluralnoun),VB(verb),VBD(verb,pasttense),VBN(verb,pastparYciple),IN(preposiYon),JJ(adjecYve),CC(conjuncYon,e.g.,“and”,“or”),PRP(pronoun),andMD(modalauxiliary,e.g.,“can”,“will”).
• Phrasescanthenbedefinedassimplenoungroups,forexample
4/7/09
8
PosTaggingExample
ExampleNounPhrases
4/7/09
9
NounPhraseInvertedLists
Q=“unitedstates”:retrieveinvertedlistforphrase“unitedstates”andprocessQ=unitedstates:retrieveinvertedlistsforterms“united”,“states”andprocess
WordN‐Grams
• POStaggingtooslowforlargecollecYons• SimplerdefiniYon–phraseisanysequenceofnwords–knownasn‐grams – bigram:2wordsequence,trigram:3wordsequence,unigram:singlewords
– N‐gramsalsousedatcharacterlevelforapplicaYonssuchasOCR
• N‐gramstypicallyformedfromoverlappingsequencesofwords– i.e.moven‐word“window”onewordataYmeindocument
4/7/09
10
WordBigrams
Tropicalfishfishincludeincludefishfishfoundfoundinintropicaltropicalenvironmentsenvironmentsaroundaroundthetheworld…
BigramInvertedLists
Thoughmanyunusualphrasesareincluded,termstaYsYcshelpensurethattheydonothurtretrieval
4/7/09
11
N‐Grams
• Frequentn‐gramsaremorelikelytobemeaningfulphrases
• N‐gramsformaZipfdistribuYon– BeGerfitthanwordsalone
• Couldindexalln‐gramsuptospecifiedlength– MuchfasterthanPOStagging
– Usesalotofstorage• e.g.,documentcontaining1,000wordswouldcontain3,990instancesofwordn‐gramsoflength2≤ n ≤ 5
GoogleN‐Grams
• Websearchenginesindexn‐grams• Googlesample:
• MostfrequenttrigraminEnglishis“allrightsreserved”– InChinese,“limitedliabilitycorporaYon”
4/7/09
12
UseTermPosiYons
• Ratherthanstorephrasesinindexdirectly,storetermposiYonsandlocatephrasesatqueryYme
• Matchphrasesorwordswithinawindow– e.g.,"tropical fish",or“findtropicalwithin5wordsoffish”
PhraseMethodTradeoffs
• POStagging:– VerylongindexYme,possibleerrors,mediumstoragerequirement,notveryflexible
– Fastphrase‐queryprocessing• N‐Grams:– Highstoragerequirement– Moreflexible,fastphrase‐queryprocessing
• TermposiYons:– Medium‐lowstoragerequirement,veryflexible– PossiblyslowerqueryprocessingduetoneedingtocalculatecollecYonstaYsYcs
4/7/09
13
Parsing
• Basicparsing:idenYfywhichpartsofdocumentstoindex,whichtoignore
• Fullparsing:idenYfyandlabelpartsofdocuments,maintainstructure,decidewhichpartsarerelaYvelymoreimportant
HTMLParsing
• AnHTMLparserproducesaDOMtree
• WewanttostorebasicterminformaYon(v,idf)aswellasinformaYonaboutthenodesthetermappersin
<HTML>
<HEAD>
<TITLE> Tropicalfish
<META>
<BODY>
<H1> Tropicalfish
<P>
<B> Tropicalfish
<A> fish
<A> tropical
includefoundinenvironmentsaroundtheworld
4/7/09
14
IndexingFields
• A]erparsingwehave:– <Ytle>:tropicalfish– <body>:tropicalfishtropicalfishincludefishfoundintropicalenvironmentsaroundtheworld…
– <h1>:tropicalfish– <b>:tropicalfish– <a>:fish– <a>:topical
• Ideasforindexing:– StorefieldinformaYonininvertedlist.– Addnewinvertedlistsforfields.– Useextentstokeeptrackoffieldsindocuments.
FieldInformaYoninInvertedLists
• CreaYngtheterminvertedlist:– Foreachdocumentthetermappearsin,• Foreachfieldthetermappearsininthatdocument,
– Storethetermfrequencywithinthefield
• Alsostorethe“fieldfrequency”– i.e.totalnumberofYmesthetermappearsineachfieldthroughthecollecYon
4/7/09
15
FieldInformaYoninInvertedList
Example
Documentfreq
<Ytle>freq
<body>freq
<h1>freq
vindoc1
vin<Ytle>indoc1
vin<body>indoc1
vin<h1>indoc1
4/7/09
16
AddNewInvertedLists
• InsteadofstoringallfieldinformaYoninonelist,createanewlistforeachfieldthetermappearsin
• AddsKnewinvertedlists,whereK=thetotalnumberoffieldsthetermappearsin.
Example
4/7/09
17
Extents
• AnextentisaconYguousregioninadocument
• DefinedbyastarYngtermposiYonandanendingtermposiYon– \ ExtentfromposiYon8
throughposiYon36
UsingExtentstoStoreFields
• StoretermposiYonsinterminvertedlists• Defineanextentinvertedlistforeachfield• IncludethedocumentnumberandrangeofposiYonstheextentincludes
4/7/09
18
FieldStorageTradeoffs
• Includefieldinfoininvertedlists:– Storageefficient,fairlyinflexible,fairlyslowprocessing
• Newlistsfortermsinfields:– Storageinefficient,moreflexible,fasterprocessing
• Fieldextents:– Storageefficient,veryflexible,fairlyfastprocessing
AnchorText
• Anchor textistextonanotherpageusedtolinktoadocument
• Canindicatewhatotherpeoplethinkthedocumentisabout
• Canbetakenasashortsummaryofthedocumentscontents
4/7/09
19
AnchorTextExample
IndexingAnchorText
• SimplesoluYon:– Includeanchortextaspartofdocumenttext
– “Tropical”termfrequency=#ofYmesitappearsinthedocument+#ofYmesitappearsinanchortextindocumentslinkingtoit
• SlightlymorecomplexsoluYon:– Includeanchortextinfields,e.g.<anchor>– Onefieldforeachlinktothedocument
4/7/09
20
InvertedListsatGoogle
• Asof1998,Googlestoredthefollowing:– Whetheratermoccurrenceis“plain”or“fancy”
• “Fancy”=occursinURL,Ytle,anchortext,ormetatag.• “Plain”=everythingelse
– Ifplain,store:• Whethercapitalized,fontsizeinformaYon,andposiYoninformaYon(in1bit,3bits,and12bitsrespecYvely)
– Iffancy,store:• Whethercapitalized,maximumfontsize,typeofhit,andposiYoninformaYon(in1bit,3bits,4bits,and8bitsrespecYvely)
• Andiftype=anchor,split8posiYonbitsinto4docIDbitsand4posiYonbits
InvertedListsatGoogle
• Example:“tropical”occurs3Ymesindocument– OncecapitalizedinYtleatposiYon1– OncecapitalizedinaheaderatposiYon4– Onceinlower‐caseinbodytextatposiYon108
• Alsooccursin2otherlinkingdocuments• Googleinvertedlistmightlooklikethis:
•
Fancyhit1(Ytle)
Fancyhit2(header)
PlainhitAnchorhit1
Anchorhit2