dynamic reference sifting jonathan shakes, marc langheinrich, and oren etzioni university of...

21
Dynamic Reference Sifting Jonathan Shakes, Marc Langheinrich, and Oren Etzioni University of Washington Department of Computer Science and Engineering A Case Study in the Homepage Domain

Post on 22-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dynamic Reference Sifting Jonathan Shakes, Marc Langheinrich, and Oren Etzioni University of Washington Department of Computer Science and Engineering

Dynamic Reference Sifting

Jonathan Shakes, Marc Langheinrich, and Oren Etzioni

University of Washington

Department of Computer Science and Engineering

A Case Studyin the Homepage Domain

Page 2: Dynamic Reference Sifting Jonathan Shakes, Marc Langheinrich, and Oren Etzioni University of Washington Department of Computer Science and Engineering

2

Intr

od

uct

ion

- O

utl

ine

Outline

Introduction Softbots and Dynamic Reference

Sifters Searching the Web

Case Study: Personal Homepages Ahoy! The Homepage Finder Experimental Results

Future and Related Work Other Domains for DRS

Page 3: Dynamic Reference Sifting Jonathan Shakes, Marc Langheinrich, and Oren Etzioni University of Washington Department of Computer Science and Engineering

3

Intr

od

uct

ion

- S

oft

bots

& D

RS

Softbots and Dynamic Reference Sifters

Dynamic Reference Sifters Part of “Internet Softbots Project”

[Etzioni and Weld, 1994]Softbots

person states what softbot determines how and where

Page 4: Dynamic Reference Sifting Jonathan Shakes, Marc Langheinrich, and Oren Etzioni University of Washington Department of Computer Science and Engineering

4

Information Retrieval Definitions

Precision Measure of Search Service

Accuracy

Recall Measure of Search Service

Comprehensiveness

Intr

od

uct

ion

- IR

Defin

itio

ns

Page 5: Dynamic Reference Sifting Jonathan Shakes, Marc Langheinrich, and Oren Etzioni University of Washington Department of Computer Science and Engineering

5

Intr

od

uct

ion

- IR

Defin

itio

ns

Precision

Precision:

Relevant Documents Irrelevant Documents

Search Space

Relevant Search ResultsAll Search Results

All Search ResultsAll Search Results

Page 6: Dynamic Reference Sifting Jonathan Shakes, Marc Langheinrich, and Oren Etzioni University of Washington Department of Computer Science and Engineering

6

Intr

od

uct

ion

- IR

Defin

itio

ns

Recall

Recall:

Relevant Documents Irrelevant Documents

Search Space

Relevant Search ResultsAll Relevant Documents

All Search ResultsAll Search Results

Page 7: Dynamic Reference Sifting Jonathan Shakes, Marc Langheinrich, and Oren Etzioni University of Washington Department of Computer Science and Engineering

7

Searching the Web

Web Indices (AltaVista, Hotbot) Automated - high recall

Keyword based - low precision

Web Directories (Yahoo, A2Z) Classified manually - high precision

- low recall

Manual Search slow

Intr

od

uct

ion

- S

earc

hin

g t

he W

eb

Page 8: Dynamic Reference Sifting Jonathan Shakes, Marc Langheinrich, and Oren Etzioni University of Washington Department of Computer Science and Engineering

8

Searching the Web

Dynamic Reference Sifter An information retrieval tool that uses: multiple, complementary data sources

for high recall, domain-specific filtering techniques for

high precision, and machine learning to improve

performance over time.

Intr

od

uct

ion

- S

earc

hin

g t

he W

eb

Page 9: Dynamic Reference Sifting Jonathan Shakes, Marc Langheinrich, and Oren Etzioni University of Washington Department of Computer Science and Engineering

9

Case Study: The Personal Homepage Domain

“Conventional” Search Services Indices find too much Directories find too little Manual Search takes too long Failures are expensive

Ahoy! The Homepage Finderattempts to provide High Recall High Precision Speed

Case

Stu

dy -

Overv

iew

Page 10: Dynamic Reference Sifting Jonathan Shakes, Marc Langheinrich, and Oren Etzioni University of Washington Department of Computer Science and Engineering

10

Case

Stu

dy -

Ah

oy!

Arc

hit

ect

ure

Ahoy! ArchitectureUser Input

Filters

Output

Web PageReference Source

InstitutionalInformation Source

E-mail Address Sources

Page 11: Dynamic Reference Sifting Jonathan Shakes, Marc Langheinrich, and Oren Etzioni University of Washington Department of Computer Science and Engineering

11

Performance AnalysisTest using lists of known homepages

Researchers sample: 582 homepages Transportation sample: 53 homepages

Compare against MetaCrawler, Hotbot, AltaVista, Yahoo!

Maximize competitors’ performance by using “expert” options allowing up to 200 references

Case

Stu

dy -

Perf

orm

an

ce A

naly

sis

Page 12: Dynamic Reference Sifting Jonathan Shakes, Marc Langheinrich, and Oren Etzioni University of Washington Department of Computer Science and Engineering

12

Case

Stu

dy -

Perf

orm

an

ce A

naly

sis

Performance Analysis

“Precision” - Researcher Sample

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

Ahoy! MetaCrawler Hotbot AltaVista Yahoo!

Search Service

Ta

rge

ts F

ou

nd

Highest-ranked Reference

Page 13: Dynamic Reference Sifting Jonathan Shakes, Marc Langheinrich, and Oren Etzioni University of Washington Department of Computer Science and Engineering

13

Case

Stu

dy -

Perf

orm

an

ce A

naly

sis

Performance Analysis

Top 10 References - Researcher Sample

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

Ahoy! MetaCrawler Hotbot AltaVista Yahoo!

Search Service

Ta

rge

ts F

ou

nd

Highest 10 ReferencesHighest-ranked Reference

Page 14: Dynamic Reference Sifting Jonathan Shakes, Marc Langheinrich, and Oren Etzioni University of Washington Department of Computer Science and Engineering

14

Case

Stu

dy -

Perf

orm

an

ce A

naly

sis

Performance AnalysisRecall (all References) - Researcher Sample

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

Ahoy! MetaCrawler Hotbot AltaVista Yahoo!

Search Service

Ta

rgets

Fo

un

d

All References Highest 10 ReferencesHighest-ranked Reference

Page 15: Dynamic Reference Sifting Jonathan Shakes, Marc Langheinrich, and Oren Etzioni University of Washington Department of Computer Science and Engineering

15

Case

Stu

dy -

Perf

orm

an

ce A

naly

sis

Performance AnalysisRecall (all References) - Transportation Sample

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

Ahoy! MetaCrawler Hotbot AltaVista Yahoo!

Search Service

Ta

rge

ts F

ou

nd

All References Highest 10 ReferencesHighest-ranked Reference

Page 16: Dynamic Reference Sifting Jonathan Shakes, Marc Langheinrich, and Oren Etzioni University of Washington Department of Computer Science and Engineering

16

Learning in Ahoy!

Learns URL ‘patterns’ http://sdcc3.ucsd.edu/home-pages/<Login>/

50,000+ patterns in 3 months

Indexes patterns by institution 11,000+ institutions indexed in 3

months

Performance Impact Up to 8% gain in recall

Case

Stu

dy -

Learn

ing

in

Ah

oy!

Page 17: Dynamic Reference Sifting Jonathan Shakes, Marc Langheinrich, and Oren Etzioni University of Washington Department of Computer Science and Engineering

17

Domain Characteristics

Many elements

Easily identifiable target

Some targets found in web indices

User can form specific query

Futu

re W

ork

- D

om

ain

ch

ara

cteri

stic

s

Page 18: Dynamic Reference Sifting Jonathan Shakes, Marc Langheinrich, and Oren Etzioni University of Washington Department of Computer Science and Engineering

18

Domain Examples

Personal HomepagesArticles or PapersProduct ReviewsPrice ListsTransportation SchedulesRecipesJokes

and more

Futu

re W

ork

- D

om

ain

exam

ple

s

Page 19: Dynamic Reference Sifting Jonathan Shakes, Marc Langheinrich, and Oren Etzioni University of Washington Department of Computer Science and Engineering

19

(un)Related Work

Automated Index Generation WebCrawler, Lycos, AltaVista, ...

Automated Directory Generation IAF, OKRA, WhoWhere?

Dynamic Internet Search Netfind

Learning User Preferences on web WebWatcher, Syskill & Webert, Firefly

Learning about the web ShopBot, auto-generated wrappers

Futu

re W

ork

- (

un

)Rela

ted

Work

Page 20: Dynamic Reference Sifting Jonathan Shakes, Marc Langheinrich, and Oren Etzioni University of Washington Department of Computer Science and Engineering

20

Summary and Conclusions

Dynamic Reference Sifting domain-specific, high precision, high recall,

fast

Ahoy! the Homepage Finder 2000 searches per day 1-2 references returned per search 50-75% targets found

25% not found, often correctly so

10-15 seconds per search

Future domains Academic Papers, Jokes

Su

mm

ary

& C

on

clu

sion

s

Page 21: Dynamic Reference Sifting Jonathan Shakes, Marc Langheinrich, and Oren Etzioni University of Washington Department of Computer Science and Engineering

21

Ahoy! the Homepage Finder

http://www.cs.washington.edu/research/ahoy/

Ah

oy!

Th

e H

om

ep

ag

e F

ind

er