beauty ofir
DESCRIPTION
Information Retrieval is about how we can search and retrieve things. In this talk, we look at the various components that make up a typical search engine and discuss the associated challenges.TRANSCRIPT
![Page 1: Beauty ofir](https://reader033.vdocument.in/reader033/viewer/2022061218/548270b0b4af9f9b0d8b47fb/html5/thumbnails/1.jpg)
Beauty of IR
Venkatesh VinayakaraoAn IR enthusiast!
![Page 2: Beauty ofir](https://reader033.vdocument.in/reader033/viewer/2022061218/548270b0b4af9f9b0d8b47fb/html5/thumbnails/2.jpg)
Venkatesh Vinayakarao 2
Disclaimer
Most examples and discussions in this talk revolve around well known search engines. This is just to get
a good learning experience. Please keep in mind that IR is beyond search engines.
25+ slides of interesting discussion ahead…
2/2014
![Page 3: Beauty ofir](https://reader033.vdocument.in/reader033/viewer/2022061218/548270b0b4af9f9b0d8b47fb/html5/thumbnails/3.jpg)
Venkatesh Vinayakarao 3
Quiz
1. Explain any two challenges in Query Intent Understanding using some examples and discuss why is it a hard problem?
2. How are “Tiles” as discussed in the class used in search engines? What purpose do they solve?
3. Search Engines have no UI related design concerns. True/False?
2/2014
![Page 4: Beauty ofir](https://reader033.vdocument.in/reader033/viewer/2022061218/548270b0b4af9f9b0d8b47fb/html5/thumbnails/4.jpg)
Venkatesh Vinayakarao 4
About Me
BE Computer Science (Y2K)
MS (IT)
IT Service Industry
Start Up
Nokia
Yahoo
Microsoft (Bing)
PhD
Let me learn everything all
over again!
2/2014
![Page 5: Beauty ofir](https://reader033.vdocument.in/reader033/viewer/2022061218/548270b0b4af9f9b0d8b47fb/html5/thumbnails/5.jpg)
Venkatesh Vinayakarao 5
Our Agenda: The Beauty of IR!
Crawling Content Processing Indexing
Me!
Query (Intent) Understanding
Ranking User Interface
Offline Horror!
Online Terror!
How to process Korean queries for
local listings?
2/2014
![Page 6: Beauty ofir](https://reader033.vdocument.in/reader033/viewer/2022061218/548270b0b4af9f9b0d8b47fb/html5/thumbnails/6.jpg)
Venkatesh Vinayakarao 6
Crawling
How frequently should we crawl? Fresh & Super-Fresh! How to crawl cricket scores? Are we even
crawling here?
How to avoid 404 - Page not found? How much time did it take google to show your first personal
page?2/2014
![Page 7: Beauty ofir](https://reader033.vdocument.in/reader033/viewer/2022061218/548270b0b4af9f9b0d8b47fb/html5/thumbnails/7.jpg)
Venkatesh Vinayakarao 7
Content Processing
Good Read: https://getlisted.org/static/resources/local-search-data-providers.html
2/2014
![Page 8: Beauty ofir](https://reader033.vdocument.in/reader033/viewer/2022061218/548270b0b4af9f9b0d8b47fb/html5/thumbnails/8.jpg)
Venkatesh Vinayakarao 8
Content Processing
Query: “Schools in Delhi” Answer: “Delhi Public School” Good or Bad?
Query: “Schools in Hyderabad” Answer: “Delhi Public School” Good or Bad?
Query: “Hotels in Bombay” Answer: “Grand Hyatt, Mumbai” Good or Bad? How to get same results for both Mumbai and Bombay?
Query: “Maruti Car service in delhi” Answer: “Rana Motors Private Limited”. What happened?
2/2014
![Page 9: Beauty ofir](https://reader033.vdocument.in/reader033/viewer/2022061218/548270b0b4af9f9b0d8b47fb/html5/thumbnails/9.jpg)
Venkatesh Vinayakarao 9
Content Processing & Indexing
A real example: http://www.yelp.com/dataset_challenge/
Enriched Business• Category Synonyms (for eg., auto service & car service are replaceable at times)• User’s query forms (for eg., McDonalds is commonly queried as McD)
2/2014
![Page 10: Beauty ofir](https://reader033.vdocument.in/reader033/viewer/2022061218/548270b0b4af9f9b0d8b47fb/html5/thumbnails/10.jpg)
Venkatesh Vinayakarao 10
Derived Values & Indexing
Given a location, how will you find all businesses within 1km radius?
Query: schools near govindpuri delhi
2/2014
![Page 11: Beauty ofir](https://reader033.vdocument.in/reader033/viewer/2022061218/548270b0b4af9f9b0d8b47fb/html5/thumbnails/11.jpg)
Venkatesh Vinayakarao 11
Query Understanding Challenge
Need a team of 3 people and one laptop.
Volunteers?
2/2014
![Page 12: Beauty ofir](https://reader033.vdocument.in/reader033/viewer/2022061218/548270b0b4af9f9b0d8b47fb/html5/thumbnails/12.jpg)
Venkatesh Vinayakarao 12
Rules
I will give an entity name. You will have to frame at least three different
(dissimilar) queries (and as many as you can) that give same document as the correct result at first place.
At the end, you should submit: Query, Max. no. of top n correct results that you
maintained to be same. You will have 5 minutes.
2/2014
![Page 13: Beauty ofir](https://reader033.vdocument.in/reader033/viewer/2022061218/548270b0b4af9f9b0d8b47fb/html5/thumbnails/13.jpg)
Venkatesh Vinayakarao 13
Questions
Tom Cruise Aishwarya Rai Tom Hanks Srikanta Bedathur Venkatesh Vinayakarao Pankaj Jalote Amir Khan Andre Agassi Manmohan Singh
2/2014
![Page 14: Beauty ofir](https://reader033.vdocument.in/reader033/viewer/2022061218/548270b0b4af9f9b0d8b47fb/html5/thumbnails/14.jpg)
Venkatesh Vinayakarao 14
Query Understanding
Query: Michael Jordon Which MJ to return? The basketball player or actor?
Factors User profile Query context (session details, browser data, links, etc) …
Query: Delhi School What does user want? “Delhi Public School” or
“Schools in Delhi” or “some Indian school in US”? Query: “IR”
Predict top three results
2/2014
![Page 15: Beauty ofir](https://reader033.vdocument.in/reader033/viewer/2022061218/548270b0b4af9f9b0d8b47fb/html5/thumbnails/15.jpg)
Venkatesh Vinayakarao 15
Ok! I give up!!
A frustrated search user: “please show me some t-shirt brands”
2/2014
![Page 16: Beauty ofir](https://reader033.vdocument.in/reader033/viewer/2022061218/548270b0b4af9f9b0d8b47fb/html5/thumbnails/16.jpg)
Venkatesh Vinayakarao 16
More fun with auto completion
2/2014
![Page 17: Beauty ofir](https://reader033.vdocument.in/reader033/viewer/2022061218/548270b0b4af9f9b0d8b47fb/html5/thumbnails/17.jpg)
Venkatesh Vinayakarao 17
System Overview (Simplified)
Front-end Front-end Front-end Front-end
Query Understanding, Query Classifiers
Web Answer Local AnswerFinance Answer
Tech Answer & Many more
KB
Index Serve Crawled Content
Crawler
WebExpanded Query
User Query
2/2014
![Page 18: Beauty ofir](https://reader033.vdocument.in/reader033/viewer/2022061218/548270b0b4af9f9b0d8b47fb/html5/thumbnails/18.jpg)
Venkatesh Vinayakarao 18
Ranking & Relevance
How do we know if the document is relevant (in web search context)?
Popularity of url Domain score (is it ac.in or .edu?) TF, IDF Entity, Chain entity? Trust Factor (Wikipedia?) Inlinks/Outlinks Position of query terms Sequence of query terms … and 1000 of such things
2/2014
![Page 19: Beauty ofir](https://reader033.vdocument.in/reader033/viewer/2022061218/548270b0b4af9f9b0d8b47fb/html5/thumbnails/19.jpg)
Venkatesh Vinayakarao 19
Are current search engines good at relevance & ranking?
Bing GoogleQuery1: Vegetarian hotels in south delhi
Query2: South Indian hotels in south delhi
2/2014
![Page 20: Beauty ofir](https://reader033.vdocument.in/reader033/viewer/2022061218/548270b0b4af9f9b0d8b47fb/html5/thumbnails/20.jpg)
Venkatesh Vinayakarao 20
…More examples
Query3: South Indian restaurants in south delhi
What’s the difference between query2 and query3? Should search engines give different results?
2/2014
![Page 21: Beauty ofir](https://reader033.vdocument.in/reader033/viewer/2022061218/548270b0b4af9f9b0d8b47fb/html5/thumbnails/21.jpg)
Venkatesh Vinayakarao 21
How far for a coffee?
Google: Just one word (iiitd) missing. So
what?
Let’s make the query as “coffee shops near iiitd delhi”.
“Coffee shops near me” gives results from Janakpuri, Gurgaon, CP & Kamla Nagar.
2/2014
![Page 22: Beauty ofir](https://reader033.vdocument.in/reader033/viewer/2022061218/548270b0b4af9f9b0d8b47fb/html5/thumbnails/22.jpg)
Venkatesh Vinayakarao 22
Why is it hard?
What makes Ranking & Relevance hard?
2/2014
![Page 23: Beauty ofir](https://reader033.vdocument.in/reader033/viewer/2022061218/548270b0b4af9f9b0d8b47fb/html5/thumbnails/23.jpg)
23
User Interface
Is UI important for search engine? Maps in local results Live sport score cards Finance tickers Filters Search Operators Entity Infoboxes
What impact does these make?
2/2014 Venkatesh Vinayakarao
![Page 24: Beauty ofir](https://reader033.vdocument.in/reader033/viewer/2022061218/548270b0b4af9f9b0d8b47fb/html5/thumbnails/24.jpg)
Venkatesh Vinayakarao 24
Our Agenda: The Beauty of IR!
Crawling Content Processing Indexing
Me!
Query (Intent) Understanding
Ranking User Interface
Offline Horror!
Online Terror!
How to process Korean queries for
local listings?
2/2014
![Page 25: Beauty ofir](https://reader033.vdocument.in/reader033/viewer/2022061218/548270b0b4af9f9b0d8b47fb/html5/thumbnails/25.jpg)
Venkatesh Vinayakarao 25
Evaluation
Various evaluation methods Precision/Recall Mean Avg Precision Mean Reciprocal Rank
If first relevant doc is at kth position, RR = 1/k. NDCG
Non-Boolean/Graded relevance scores DCG = r1 + r2/log22 + r3/log23 + … rn/log2n
2/2014
![Page 26: Beauty ofir](https://reader033.vdocument.in/reader033/viewer/2022061218/548270b0b4af9f9b0d8b47fb/html5/thumbnails/26.jpg)
Venkatesh Vinayakarao 26
NDCG - Example
i
Ground Truth Ranking Function1 Ranking Function2
Document Order
riDocument Order
riDocument Order
ri
1 d4 2 d3 2 d3 2
2 d3 2 d4 2 d2 1
3 d2 1 d2 1 d4 2
4 d1 0 d1 0 d1 0
NDCGGT=1.00 NDCGRF1=1.00 NDCGRF2=0.9203
4 documents: d1, d2, d3, d4
Taken from http://www.stanford.edu/class/cs276/handouts/EvaluationNew.ppt
2/2014
![Page 27: Beauty ofir](https://reader033.vdocument.in/reader033/viewer/2022061218/548270b0b4af9f9b0d8b47fb/html5/thumbnails/27.jpg)
Venkatesh Vinayakarao 27
Are we done?
Q & A
2/2014