characterizing search behavior in web archives · characterizing search behavior in web archives...

26
Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation for National Scientific Computing TWAW2011, Hyderabad, India

Upload: truongliem

Post on 09-Nov-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation

Characterizing Search Behavior

in Web Archives

Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon

Foundation for National Scientific Computing

TWAW2011, Hyderabad, India

Page 2: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation

2

Ephemeral Web

• The web contains unique and valuable information

– news, interviews, opinions, feelings

• 80% of the web documents are

unavailable after 1 year.

• Knowledge gap for future generations

Page 3: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation

3

Web Archiving Initiatives

• 42 web archiving initiatives in 26 countries.

• +180 billion documents archived since 1996.

Page 4: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation

4

Web Archiving Workflow

Acquisition Storage Searching

Preservation

Presentation Searching

• Search technology based on web search engines

– ignores the temporal dimension

– don’t understand the end users

Page 5: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation

5

1st : Understanding Users

• Why do users search? (information needs)

• What do users search for? (topics)

• How do users search? (search behavior) – this study: 1st characterization

Page 6: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation

6

Predicting users’ behavior can improve

• Response time – e.g. cache, special indexes

• Quality of results – e.g. better ranking, suggest queries

• Web design – e.g. make most used functionalities stand out

Page 7: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation

7

Portuguese Web Archive

• Archives the Portuguese Web ≈ .PT domain

• ≈ 182M documents:

• searchable by full-text and URL.

• range between 1996 and 2009.

• Search available since 2010.

http://archive.pt

Page 8: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation

8

Interface: full-text search

Results Page

Page 9: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation

9

Interface: URL search

Versions Page

same text box

Page 10: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation

10

Methodology

Page 11: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation

11

Search Log Analysis

• Pros

• Large and varied

• Less bias

• Cheaper

• Non-intrusive

• Real information needs

• Cons

• Lack of context

• Lack of control

Page 12: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation

12

Dataset of Search Logs

• ≈ 10K sessions - 7 months of 2010

• Procedure

• cleansing

• normalized and excluded invalid sessions & queries

• session delimitation

• used IP, user session and a 30 minute gap

• Users

• 72% of IP addresses → Portugal

• 89% of interactions → PT language interface

Page 13: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation

13

How do users search?

Page 14: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation

14

General Statistics

• Full-text sessions + URL sessions ≈ 90%

• Full-text sessions / URL sessions ≈ 2:1

• A typical full-text session:

• 1 or 2 queries

• 1 to 3 terms per query

• 1 or 2 result pages seen per query

• 1 click per query

• A typical URL session:

• 1 or 2 queries

• 1 or 2 clicks per query

Page 15: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation

15

Query Distribution

0%

10%

20%

30%

40%

50%

60%

70%

1 2 3 4 5 6 7 8 9 ≥10

% s

essio

ns

# queries

# full-text queries per session

85%

Page 16: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation

0%

5%

10%

15%

20%

25%

30%

35%

≤-5 -4 -3 -2 -1 0 1 2 3 4 ≥5

% q

ue

rie

s

# terms changed

# full-text terms changed

71%

16

Query Refinement

terms

added in

42% of

queries

terms

removed

in 25% of

queries

Page 17: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation

17

Exploring Popularity

Many

Few

Popular

Rare

• Queries, terms, clicks and archived pages seen

• follow a power law distribution

• 27% top queries → 50% query volume

• 6% top terms → 50% query volume

• 10% top pages seen → 26% all pages seen

• 66% clicks → 1st result page

Page 18: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation

18

How do users search?

• Spend little time and effort on individual searches

• Search and explore following power law distributions

• Search in web archives as in web search engines

• Excite (U.S.), Fast (Europe), Tumba! (Portugal)

• A little less queries, but a little longer

Page 19: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation

19

But, what about time?

Page 20: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation

20

1/3 Queries are Restricted by Date

0%

5%

10%

15%

20%

25%

30%

35%

start date end date start & enddate

% q

ue

rie

s

restriction

% queries restricted by date

full-text

URL

Page 21: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation

21

Oldest Versions are more Searched

30

40

50

60

70

80

90

100

199

6

199

7

199

8

199

9

200

0

200

1

200

2

2003

200

4

200

5

200

6

200

7

200

8

200

9

% q

ue

rie

s r

estr

icte

d b

y d

ate

years

full-text queries

URL queries

Page 22: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation

22

Oldest Versions are more Clicked

0

10

20

30

40

50

60

1 2 3 4 5 6 7 8 9 # c

licks/#

tim

es d

isp

laye

d

Page 23: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation

23

Conclusions

Page 24: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation

24

Conclusions

• Web archive users:

– search as in web search engines

– prefer full-text search over URL search

– prefer the oldest documents over the newest

Page 25: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation

25

Future Work

• Validate results

– larger datasets, other sources, throughout time

• Use results to improve:

– ranking considering time

– throughput and response speed

– user interface

Page 26: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation

Questions

Thank you.

http://archive.pt