query and document understanding
TRANSCRIPT
![Page 1: Query and document understanding](https://reader030.vdocument.in/reader030/viewer/2022032505/55c4ffbbbb61ebcc6c8b4603/html5/thumbnails/1.jpg)
Rishiraj Saha Roy
Ph.D. Student under
Prof. Niloy Ganguly (IIT Kharagpur) and
Dr. Monojit Choudhury (Microsoft Research India)
Pre-FIRE Workshop on Information Retrieval Bengaluru, 15 - 17 June 2013
![Page 2: Query and document understanding](https://reader030.vdocument.in/reader030/viewer/2022032505/55c4ffbbbb61ebcc6c8b4603/html5/thumbnails/2.jpg)
June 17, 2013 Pre-FIRE Workshop on Information Retrieval 2
![Page 3: Query and document understanding](https://reader030.vdocument.in/reader030/viewer/2022032505/55c4ffbbbb61ebcc6c8b4603/html5/thumbnails/3.jpg)
What is “not” understanding?
Query: compare performance shikhar dhawan rohit sharma
Document: Shikhar Dhawan has much better shot placement
than Rohit Sharma.
June 17, 2013 Pre-FIRE Workshop on Information Retrieval 3
compare performance shikhar dhawan rohit sharma
has than better shot shikhar dhawan rohit placement much sharma
![Page 4: Query and document understanding](https://reader030.vdocument.in/reader030/viewer/2022032505/55c4ffbbbb61ebcc6c8b4603/html5/thumbnails/4.jpg)
Much more to queries and documents than
keywords and their frequencies!!!
June 17, 2013 Pre-FIRE Workshop on Information Retrieval 4
![Page 5: Query and document understanding](https://reader030.vdocument.in/reader030/viewer/2022032505/55c4ffbbbb61ebcc6c8b4603/html5/thumbnails/5.jpg)
Query: create hyperlinks in excel
Forums
create hyperlinks in word …. Filters in excel have to be
specified with…
Spam (?)
Zingo.com – Your one stop tech quide. Best excel tips | Best
hyperlinks in your page | Create your own blog today
June 17, 2013 Pre-FIRE Workshop on Information Retrieval 5
![Page 6: Query and document understanding](https://reader030.vdocument.in/reader030/viewer/2022032505/55c4ffbbbb61ebcc6c8b4603/html5/thumbnails/6.jpg)
Query 1: us open home page
Query 2: chrome cant open home page
US open official site by IBM. Cant view page
properly? Best viewed in Google Chrome.
June 17, 2013 Pre-FIRE Workshop on Information Retrieval 6
![Page 7: Query and document understanding](https://reader030.vdocument.in/reader030/viewer/2022032505/55c4ffbbbb61ebcc6c8b4603/html5/thumbnails/7.jpg)
Relative word orders important
china detains india traders latest news
Query segmentation
glass office windows open office windows
Entities, Attributes and Relations
france capital, polio symptoms, bon jovi age
barclays capital
capital punishment
June 17, 2013 Pre-FIRE Workshop on Information Retrieval 7
![Page 8: Query and document understanding](https://reader030.vdocument.in/reader030/viewer/2022032505/55c4ffbbbb61ebcc6c8b4603/html5/thumbnails/8.jpg)
And much more!!!
Term proximities
Term dependencies
Term and page annotations
…
Endless research areas………..
June 17, 2013 Pre-FIRE Workshop on Information Retrieval 8
![Page 9: Query and document understanding](https://reader030.vdocument.in/reader030/viewer/2022032505/55c4ffbbbb61ebcc6c8b4603/html5/thumbnails/9.jpg)
June 17, 2013 Pre-FIRE Workshop on Information Retrieval 9
2.21
3.5 3.98
0
1
2
3
4
5
2000 2006 2010
The mean length of (distinct) Web search queries is increasing
> 8 words Long Queries (3.2%)
3 to 8 words Medium Queries (80%)
< 3 words Short Queries (14%)
![Page 10: Query and document understanding](https://reader030.vdocument.in/reader030/viewer/2022032505/55c4ffbbbb61ebcc6c8b4603/html5/thumbnails/10.jpg)
Query understanding: Why? How?
Queries do not follow any formal grammar
“EMERGENCY HATCH PENGUIN EGGS HOW”
medicines for high pressure otc only
samsung galaxy gprs config at&t
June 17, 2013 Pre-FIRE Workshop on Information Retrieval 10
![Page 11: Query and document understanding](https://reader030.vdocument.in/reader030/viewer/2022032505/55c4ffbbbb61ebcc6c8b4603/html5/thumbnails/11.jpg)
Reordering, function words, multiword expressions, part NL
Natural language processing (NLP) / Linguistics-based
techniques fail!
Computationally expensive!
Simple data-driven statistical approaches
Empirical formulations
Provide noticeable improvements!!
June 17, 2013 Pre-FIRE Workshop on Information Retrieval 11
![Page 12: Query and document understanding](https://reader030.vdocument.in/reader030/viewer/2022032505/55c4ffbbbb61ebcc6c8b4603/html5/thumbnails/12.jpg)
Query segmentation
Why?
A simple how
Extracting Entities and Attributes
Why?
Some simple hows
June 17, 2013 Pre-FIRE Workshop on Information Retrieval 12
![Page 13: Query and document understanding](https://reader030.vdocument.in/reader030/viewer/2022032505/55c4ffbbbb61ebcc6c8b4603/html5/thumbnails/13.jpg)
Dividing a query into individual semantic units (Bergsma and
Wang,2007)
Example
australian open home page →
australian open | home page
australian | open home | page
June 17, 2013 Pre-FIRE Workshop on Information Retrieval 13
![Page 14: Query and document understanding](https://reader030.vdocument.in/reader030/viewer/2022032505/55c4ffbbbb61ebcc6c8b4603/html5/thumbnails/14.jpg)
Goes beyond multiword named entity recognition (gprs
config, history of, how to)
Helps in better query understanding
Query expansion, query suggestions
Can improve IR performance by increasing precision
north america versus north of america
June 17, 2013 Pre-FIRE Workshop on Information Retrieval 14
![Page 15: Query and document understanding](https://reader030.vdocument.in/reader030/viewer/2022032505/55c4ffbbbb61ebcc6c8b4603/html5/thumbnails/15.jpg)
Simple algorithm – Pointwise Mutual Information
𝑃𝑀𝐼 𝑎𝑏 = log2
𝑝(𝑎𝑏)
𝑝 𝑎 ∗ 𝑝(𝑏)
Compute probabilities from any source – documents,
queries, page titles, anchor text
Microsoft Web n-gram services
http://research.microsoft.com/en-us/collaboration/focus/cs/web-ngram.aspx
June 17, 2013 Pre-FIRE Workshop on Information Retrieval 15
![Page 16: Query and document understanding](https://reader030.vdocument.in/reader030/viewer/2022032505/55c4ffbbbb61ebcc6c8b4603/html5/thumbnails/16.jpg)
PMI measures strength of bonding – by chance or by choice?
Meanigful bigrams have high PMI – harry potter, blood
pressure, jurassic park, difference between
Measure PMI of adjacent word pairs
Fix significance threshold
Insert boundary whenever PMI falls below threshold
June 17, 2013 Pre-FIRE Workshop on Information Retrieval 16
![Page 17: Query and document understanding](https://reader030.vdocument.in/reader030/viewer/2022032505/55c4ffbbbb61ebcc6c8b4603/html5/thumbnails/17.jpg)
Input: australian open home page
PMI(australian, open) = 15.89
PMI(open, home) = 5.43
PMI(home, page) = 13.92
Threshold: 8.50
Output: australian open | home page
Problem: Not optimized over whole query!!
June 17, 2013 Pre-FIRE Workshop on Information Retrieval 17
![Page 18: Query and document understanding](https://reader030.vdocument.in/reader030/viewer/2022032505/55c4ffbbbb61ebcc6c8b4603/html5/thumbnails/18.jpg)
jetbeam rrt-01
Where to buy? How to use? Life? Weight? ….
roger federer
Return information in structured form
lotr cast
Book? Movie? Game?
June 17, 2013 Pre-FIRE Workshop on Information Retrieval 18
![Page 19: Query and document understanding](https://reader030.vdocument.in/reader030/viewer/2022032505/55c4ffbbbb61ebcc6c8b4603/html5/thumbnails/19.jpg)
Simplest – List based approach
Wikipedia titles:
http://dumps.wikimedia.org/enwiki/latest/
5 million entries, 2 GB RAM, no problem
June 17, 2013 Pre-FIRE Workshop on Information Retrieval 19
![Page 20: Query and document understanding](https://reader030.vdocument.in/reader030/viewer/2022032505/55c4ffbbbb61ebcc6c8b4603/html5/thumbnails/20.jpg)
Efficient data structures – Trie, Dictionary
Low memory
Fast search
Lists work great, extensive commercial use
Annotate both queries and documents
June 17, 2013 Pre-FIRE Workshop on Information Retrieval 20
![Page 21: Query and document understanding](https://reader030.vdocument.in/reader030/viewer/2022032505/55c4ffbbbb61ebcc6c8b4603/html5/thumbnails/21.jpg)
howard shore music director
June 17, 2013 Pre-FIRE Workshop on Information Retrieval 21
![Page 22: Query and document understanding](https://reader030.vdocument.in/reader030/viewer/2022032505/55c4ffbbbb61ebcc6c8b4603/html5/thumbnails/22.jpg)
Often need to view very large files – lists, logs
LTF Viewer – an unsung hero
http://www.swiftgear.com/ltfviewer/features.html
Vim, Cygwin, command-based
Edit programmatically only
June 17, 2013 Pre-FIRE Workshop on Information Retrieval 22
![Page 23: Query and document understanding](https://reader030.vdocument.in/reader030/viewer/2022032505/55c4ffbbbb61ebcc6c8b4603/html5/thumbnails/23.jpg)
More than one match
the dark knight, the dark knight rises
tom cruise ship scene
False positives – Match, but not entity
list of capitals
June 17, 2013 Pre-FIRE Workshop on Information Retrieval 23
![Page 24: Query and document understanding](https://reader030.vdocument.in/reader030/viewer/2022032505/55c4ffbbbb61ebcc6c8b4603/html5/thumbnails/24.jpg)
Why?
User wants specific results
galaxy note specs
Intent diversification
galaxy note (What about it??)
Pictures, specs, stores, prices, accessories
June 17, 2013 Pre-FIRE Workshop on Information Retrieval 24
![Page 25: Query and document understanding](https://reader030.vdocument.in/reader030/viewer/2022032505/55c4ffbbbb61ebcc6c8b4603/html5/thumbnails/25.jpg)
Using documents: Template based
What is the A of I <what … A … I>
I’s A
Who was A of I <who … A … I>
A of I
A in I
June 17, 2013 Pre-FIRE Workshop on Information Retrieval 25
![Page 26: Query and document understanding](https://reader030.vdocument.in/reader030/viewer/2022032505/55c4ffbbbb61ebcc6c8b4603/html5/thumbnails/26.jpg)
Ps2’s accessories
Accessories of galaxy note
New Delhi is the capital of India
Paris is the capital of france
Manmohan Singh is the prime minister of India
??? is the prime minister of Pakistan
June 17, 2013 Pre-FIRE Workshop on Information Retrieval 26
![Page 27: Query and document understanding](https://reader030.vdocument.in/reader030/viewer/2022032505/55c4ffbbbb61ebcc6c8b4603/html5/thumbnails/27.jpg)
Challenge
Hall of fame
Wall of shame
Shindler’s list
Beijing’s mist
June 17, 2013 Pre-FIRE Workshop on Information Retrieval 27
![Page 28: Query and document understanding](https://reader030.vdocument.in/reader030/viewer/2022032505/55c4ffbbbb61ebcc6c8b4603/html5/thumbnails/28.jpg)
Using query logs or documents – Co-occurrence
counts
Common wisdom: Attributes are frequent words
More robust statistics: They co-occur with a higher
number of distinct words
June 17, 2013 Pre-FIRE Workshop on Information Retrieval 28
![Page 29: Query and document understanding](https://reader030.vdocument.in/reader030/viewer/2022032505/55c4ffbbbb61ebcc6c8b4603/html5/thumbnails/29.jpg)
nikon camera prices, winter coats prices, property
prices in bengaluru, microsoft share prices
nikon camera prices, nikon camera models, nikon
camera for sale, nikon camera lens
Issues: Where to draw the line?
lyrics, recipe, cast
after, test, centre, black, server
June 17, 2013 Pre-FIRE Workshop on Information Retrieval 29
![Page 30: Query and document understanding](https://reader030.vdocument.in/reader030/viewer/2022032505/55c4ffbbbb61ebcc6c8b4603/html5/thumbnails/30.jpg)
Keyword-based retrieval good, but not enough
Query and document understanding are required to boost IR
performance
Methods used need to be fast and scalable
Query segmentation is a first step towards better query
representation
Entities and attributes can be identified effectively using simple
approaches
June 17, 2013 Pre-FIRE Workshop on Information Retrieval 30
![Page 31: Query and document understanding](https://reader030.vdocument.in/reader030/viewer/2022032505/55c4ffbbbb61ebcc6c8b4603/html5/thumbnails/31.jpg)
http://bit.ly/19b2dMC
June 17, 2013 Pre-FIRE Workshop on Information Retrieval 31
![Page 32: Query and document understanding](https://reader030.vdocument.in/reader030/viewer/2022032505/55c4ffbbbb61ebcc6c8b4603/html5/thumbnails/32.jpg)
June 17, 2013 Pre-FIRE Workshop on Information Retrieval 32
![Page 33: Query and document understanding](https://reader030.vdocument.in/reader030/viewer/2022032505/55c4ffbbbb61ebcc6c8b4603/html5/thumbnails/33.jpg)
June 17, 2013 Pre-FIRE Workshop on Information Retrieval 33