type-enabled keyword searches with uncertain schema soumen chakrabarti iit bombay soumen
TRANSCRIPT
Type-enabled Keyword Searches with Uncertain Schema
Soumen Chakrabarti
IIT Bombay
www.cse.iitb.ac.in/~soumen
ICML 2005 Chakrabarti 2
Evolution of Web search The first decade of Web search
• Crawling and indexing at massive scale• Macroscopic whole-page connectivity analysis• Very limited expression of information need
Exploiting entities and relations—clear trend• Maintaining large type systems and ontologies• Discovering mentions of entities and relations• Deduplicating and canonicalizing mentions• Forming uncertain, probabilistic E-R graphs• Enhancing keyword or schema-aware queries
ICML 2005 Chakrabarti 3
DisambiguationNamed entity tagging
Relation tagging
Raw corpus
Annotated corpus
Textindex
Annotationindex
Indexer
Past query workload stats
Ranking engineQuestion
Answer typepredictor
Keywordmatch predictor Response snippets
Wor
dNet
Wik
iped
ia
Fra
meN
et
Kno
wIt
All
Uniform lexicalnetwork provider
1
2
3
4
ICML 2005 Chakrabarti 4
Populating entity and relation tables Hearst patterns (Hearst 1992)
• T such as x, x and other T, x is a T
DIPRE (Brin 1998) Snowball (Agichtein+ 2000)
• [left] entity1 [middle] entity2 [right]
PMI-IR (Turney 2001)• Recognize synonyms using Web stats
KnowItAll (Etzioni+ 2004) C-PANKOW (Cimiano+ 2005)
• Is-a relations from Hearst patterns, lists, PMI
ICML 2005 Chakrabarti 5
DIPRE and Snowball
Seed tuples
Tag mentions in free text
Generate extraction patterns
Locate new tuples
Augmented table
… the Irving-based Exxon Corporation …
location organization
ℓ m r
Encoded as bag-of-words
ICML 2005 Chakrabarti 6
Scoring patterns and tuples
Pattern confidence = m+/(m+ + m−) over validation tuples
Soft-or tuple confidence =
Recent improvements: Urn model (Etzioni+ 2005)
i ii tpmatchpconf ),()(11
Uses 5-part encoding
DIPRESnowball
ICML 2005 Chakrabarti 7
KnowItAll and C-PANKOW A “propose-validate” approach
• Using existing patterns, generate queries• For each web page w returned
• Extract potential fact e andassign confidence score
• Add fact to database if ithas high enough score
Patterns use chunk info
ICML 2005 Chakrabarti 8
Exploiting answer types with PMI From two word queries to two text boxes
• author; “Harry Potter”• person; “Eiffel Tower”• director; Swades movie• city; India Pakistan cricket
Keywords search engine snippets Every token/chunk in a snippet is a candidate
• Elimination hacks that we won’t discuss
Fire Hearst pattern queries between desired answer type and candidate token/chunk
Keywordsto match
Answer type
ICML 2005 Chakrabarti 9
Information carnivores at work
Probe Word PhraseKhalid 1.3M 0Omar 6.63M 0sort 130M 0Karachi 2.51M 629Pakistan 50.5M 1
“Garth Brooks is a country” [singer],“gift such as wall” [clock]
“person like Paris” [Hilton],“researchers like Michael Jordan” (which one?)
KO :: India Pakistan Cricket SeriesA web site by Khalid Omar, sort of live from Karachi, Pakistan.
“cities such as [probe]”
“[probe] and othercities”, “[probe] is acity”, etc.
ICML 2005 Chakrabarti 10
Sample output author; “Harry Potter”
• J K Rowling, Ron
person; “Eiffel Tower”• Gustave, (Eiffel), Paris
director; Swades movie• Ashutosh Gowariker, Ashutosh Gowarikar
What can search engines do to help?• Cluster mentions and assign IDs • Allow queries for IDs — expensive!• “Harry Potter” context in “Ron is an author”
Ambiguity andextremely skewed
Web popularity
ICML 2005 Chakrabarti 11
DisambiguationNamed entity tagging
Relation tagging
Raw corpus
Annotated corpus
Textindex
Annotationindex
Indexer
Past query workload stats
Ranking engineQuestion
Answer typepredictor
Keywordmatch predictor Response snippets
Wor
dNet
Wik
iped
ia
Fra
meN
et
Kno
wIt
All
Uniform lexicalnetwork provider
1
2
3
4
ICML 2005 Chakrabarti 12
Answer type (atype) prediction Standard sub-problem in question answering Increasingly important (but more difficult) for
grammar-free Web queries (Broder 2002) Current approaches
• Pattern matching, e.g. head of noun phrase adjacent to what or which; map when, who, where, directly to classes time, person, place
• Coupled perceptrons (Li and Roth, 2002)• Linear SVM on bag-of-2grams (Hacioglu 2002)• SVM with tree kernel on parse (Zhang and Lee,
2004): slim gains Surely a parse tree holds more usable info
ICML 2005 Chakrabarti 13
Informer span A short, contiguous span of question tokens
reveals the anticipated answer type (atype) Except in multi-function questions, one
informer span is dominant and sufficient• What is the weight of a rhino?• How much does a rhino weigh?• How much does a rhino cost?• Who is the CEO of IBM?
Question parse informer span tagger Learn atype label from informer + question
ICML 2005 Chakrabarti 14
Example
Pre-in-post Markov process produces question Train a CRF with features derived from parse tree
• POS, attachments to neighboring chunks, multiple levels• First noun chunk? Adjacent to second verb?
What is the capital city of Japan
WP VBZ DT NN NN IN NNP
NP NP
PP
NP
VP
SQ
SBARQ
WHNP
0
1
2
3
4
5
6
Leve
l
1 2 3
What,is, the
capital,city
of,Japan
(start)
ICML 2005 Chakrabarti 15
Atype guessing accuracy
Qu
est
ion
wh
-wo
rd
#Q
ue
stio
ns
2g
ram
SV
M
2g
ram
+P
erf
ect
2g
ram
+H
eu
rist
ic
2g
ram
+C
RF
what 349 73.6 85.1 79.1 83.1which 11 81.8 90.9 54.5 81.8when 28 100 100 100 100where 27 92.6 88.9 92.6 88.9who 47 97.9 100 100 97.9how* 32 87.5 87.5 90.6 90.6rest 6 66.7 100 66.7 66.7Aggr 79.4 88 82.6 86.2
Question
TrainedCRF
Filter
Informerfeature
generator
Ordinaryfeature
generator
Merge
Linear SVM
Feature vector
Atype
ICML 2005 Chakrabarti 16
DisambiguationNamed entity tagging
Relation tagging
Raw corpus
Annotated corpus
Textindex
Annotationindex
Indexer
Past query workload stats
Ranking engineQuestion
Answer typepredictor
Keywordmatch predictor Response snippets
Wor
dNet
Wik
iped
ia
Fra
meN
et
Kno
wIt
All
Uniform lexicalnetwork provider
1
2
3
4
ICML 2005 Chakrabarti 17
Scoring function for typed search Instance of atype “near” keyword matches
• IR systems: “hard” proximity predicates• Search engines: unknown reward for proximity• XML+IR, XRank: “hard” word containment in subtree
tele
visi
on
wa
s
inve
nte
d in
192
5.
Inve
nto
r
Joh
n B
air
d
wa
s
bor
n
person#n#1
IS-A
CandidateSelectors
Not closest
Question: Whoinvented the television?
Atype: person#n#1
Selectors: invent*,television
)),(()()( asgapdecaysenergyascore iiis
Up to somemaximum
window
ICML 2005 Chakrabarti 18
Learning a scoring function
s
asgapii
isIDFascore ),()(max)(
Assume parametric form for a ranking classifier• Form of IDF, window size, • Can also choose among
decay function forms Question-answer pairs
give partial orders (Joachims 2004)
Recall in top-50,mean reciprocal rank
Recall MRR0.7 0.7 0.370.8 0.77 0.40.9 0.87 0.44
w=40, k=50
w Recall MRR20 0.83 0.430 0.87 0.4240 0.87 0.44
d=0.9, k=50
IDF Recall MRRlin 0.68 0.32log 0.81 0.4
ICML 2005 Chakrabarti 19
Indexing issues Standard IR posting: word {(doc,offsets)}
• word1 near word2 is standard• instance-of(atype) near {word1, word2,…}
WordNet has 80000 atype nodes, 17000 internal, depth > 10• “horse” also indexed as mammal, animal, sports
equipment, chess piece,…• Original corpus 4GB, gzipped corpus 1.3GB,
IR index 0.9GB, full atype index 4.3GB
XML structure indices not designed for fine-grain, word-as-element-node use
ICML 2005 Chakrabarti 20
Exploit skew in query atypes?integer#n#1 100 author#n#1 7location#n#1 78 state#n#2 6person#n#1 77 number#n#1 6city#n#1 20 date#n#1 6name#n#1 10 actor#n#1 6company#n#1 7 movie#n#1 5
Index only a small registered set of atypes R Relax query atype a to generalization g in R Test a response reachability and retain/discard How to pick R? What is a good objective?
• Relaxed query and discarding steps cost extra time• Rare atypes in what, which, and name questions—
long-tailed distribution
ICML 2005 Chakrabarti 21
Approx objective and approach Index space approx Expected query time bloat is approx
Minimize approx index space with an upper bound on bloat (hard, as expected)
Sparseness: queryProb(a) observed to be zero for most a-s in a large taxonomy
Smooth using similarity between atypes
RaatcorpusCoun )(
Ta atcorpusCoun
gtcorpusCoun
agRg )(
)(
,
mina)queryProb(
ICML 2005 Chakrabarti 22
Sample results Index space approximation reasonable Reasonable average query time bloat with
small index space overheads
|R| AveBloat30 5.1647 1.7All 1
Using a
Using g
Queries
Run
time
|R| IndexSize9 409M
13 498M29 911MAll 4300M
ICML 2005 Chakrabarti 23
Summary Entity and relation annotators
• Maturing technology• Unlikely to be perfect for open-domain sources
The future: query paradigms that combine text and annotations• End-user friendly selection and aggregation• Allow uncertainty, exploit redundancy
Can we scale to terabytes of text? Will centralized search engines be feasible? How to federate annotation management?