challenges in professional search - hogeschool leiden · source: idc. web search engines great...

40
Challenges in Professional Search Evangelos Kanoulas [email protected]

Upload: others

Post on 12-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

Challenges in Professional 

Search

Evangelos Kanoulas

[email protected]

Page 2: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

Who am I?

• Assistant Professor at Institute of Informatics (UvA)

• Director of Data Science MSc program (UvA, VU, ADS)

• Before that:

– Google Research & University of Sheffield

• My background:

– Computer Science (PhD and MSc, Northeastern Univ.)

– Joint degree on Informatics & Economics (BS)

• My expertise: 

– Information Retrieval, Text Mining, and Natural Language Understanding

Page 3: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

Professional Search

“… employees spend 1.8 hours every 

day— 9.3 hours per week, on average—

searching and gathering information.” –

source: McKinsey

“the knowledge worker spends about 

2.5 hours per day, or roughly 30% of the 

workday, searching for information” –

source: IDC

Page 4: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

Web Search Engines

Great at answering simple user questions

Page 5: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

Web Search

• Find one (or a few) good web‐pages

• High redundancy in information on the web

• High redundancy in user signals• E.g. clicks on documents, query re‐writes

Page 6: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

Professional Search

• (Often) exploratory search

• Users do not know exactly what they are looking for or …

• … how to phrase their request (query)

Page 7: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

Professional Search

• Total‐recall search

• Users need to find (nearly) everything about a topic X

• Exhaustive research • X = me, my PhD topic, ebola

• Investigation• X = somebody or something  or some activity

• Systematic review • X = studies measuring a particular effect 

• Patent search • X = prior art

Page 8: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

Professional Search

• There is no (simple) single query

A sample MEDLINE query

1. exp vitamin A/2. vitamin A.mp3. retinol.mp4. exp dietary supplements/5. or/1-46. exp pneumonia/7. pneumonia$.mp8. exp pneumonia, bacterial/9. exp pneumonia, lipid/10. exp pneumonia, mycoplasma/...14. exp pneumonia, viral/15. exp respiratory tract infections/16. acute adj respiratory.mp17. respiratory adj infection.mp18. respiratory adj disease.mp19. or/6-1820. 5 and 19

Main Question: Is adjunctive vitamin A 

effective in children diagnosed  with non‐

measles pneumonia?

Page 9: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

Crawling

Pre‐processing

& Indexing

Query understanding

Logging

Quality

Freshness

Spaminess

Clicks

Profiles

Ranking 

AlgorithmContent

Modern

SearchEngines

Page 10: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

Modern Search Engines

Page 11: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

Batch Learning

• Requires labeling data (query – document pairs)

• Time‐consuming, and boring

Page 12: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

Batch Learning

• Leads to a static, one‐size‐fits‐all search engine

Page 13: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

User Feedback

• Leads to a static, one‐size‐fits‐all search engine

1

2

Page 14: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

TREC Total Recall

Objective:

1. Find documents containing nearly all relevant information …

2. … while uncovering [relatively] few documents 

1

Page 15: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

TREC Total Recall: Participation

results

human assessor

search algorithm

query

document

collection

Page 16: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

TREC Total Recall: Participation

• Play‐at‐home

• Data collection and queries available via internet

• Automated assessor accessed via the Internet

• Play‐in‐sandbox

• Submit virtual appliance that works isolated from internet 

• Downloads corpus, topic from intranet 

• “Uncover” documents one at a time via intranet

Page 17: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

TREC Total Recall

Page 18: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

User‐in‐the‐loop strategies

• Extreme relevance feedback

• Batch learning

• uncover training set; rank

• Online learning [UvA/HvA submission]

Page 19: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

Online Learning

• Learn‐as‐you‐go

• Requires user feedback (implicit or explicit)

• Serves the user and builds a training collection at the same time

search 

algorithm

user

examine 

document

generates 

feedback

documents

query

feedback

Page 20: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

Online Learning

• Learn‐as‐you‐go

• Requires user feedback (implicit or explicit)

• Serves the user and builds a training collection at the same time

• The collection contains feedback (e.g. labels) only on items you show to the user

Page 21: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

Exploitation: Baseline Model

1. Run ad hoc search to construct a synthetic training dataset

• Unsupervised method – no training data needed

2. Train a classifier

3. Predict relevance for the remaining collection

4. Select a few highest‐scoring documents for review.

5. Review the documents, coding each as “relevant” or “not relevant.”

6. Add the documents to the training set.

Page 22: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

Exploitation: Baseline Model

Page 23: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

Exploitation: Baseline Model

Page 24: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

Exploitation: Baseline Model

Page 25: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

Exploitation: Baseline Model

uncover the 

most relevant 

document to 

present

Page 26: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

Exploitation: Active Learning

Page 27: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

Exploitation : Active Learning

Page 28: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

Exploitation : Active Learning

Page 29: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

Exploitation : Active Learning

uncover the 

most informative 

document to 

present

Page 30: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

Exploration: Hierarchical Clustering

Page 31: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

Exploration: Hierarchical Clustering

Page 32: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

Exploration: Hierarchical Clustering

Page 33: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

Reinforcement Learning

Balances:

1. Exploitation– uncover the most relevant document

2. Exploitation– uncover the most informative document

3. Exploration– uncover documents from different regions

Page 34: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

User Feedback

• Leads to a static, one‐size‐fits‐all search engine

1

2

Page 35: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

SessionPersonalization

Page 36: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

TREC Session

Objective:

• Improve retrieval performance for a given query by using the session prior to this 

query

2

Page 37: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

TREC Session: Test Collection

Page 38: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

Query change

• Changes in the query• Adding a term• Removing a term• Keeping a term

• Correlations between Δ(query) and feedback

• Task stage• Sub‐tasks

• User stage• Struggling• Exploring• Exploiting

Travel to Beijing

Flight 

tickets

Hotel 

RoomMap

Conference POI

Task/Subtasks

Exploit/

Explore/

StruggleQuery Changes

Page 39: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

Dialog Systems

• Fully conversational system

• Search algorithm asking questions to the user

Page 40: Challenges in Professional Search - Hogeschool Leiden · source: IDC. Web Search Engines Great atanswering simple user questions. Web Search •Find one(or a few) good web‐pages

Conclusions

• Professional Search

• Exploratory

• Complex

• Recall‐oriented

• Fully conversational systems

• Receive feedback

• Documents

• Query rewrites

• Explicitly ask for feedback