temporal spread in archived composite resources (work in progress) scott g. ainsworth michael l....

Post on 12-Jan-2016

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

TEMPORAL SPREAD IN ARCHIVEDCOMPOSITE RESOURCES(WORK IN PROGRESS)

SCOTT G. AINSWORTH

MICHAEL L. NELSON

OLD DOMINION UNIVERSITY

COMPUTER SCIENCE

WADL 2013

JULY 25–26, 2013

INDIANAPOLIS, INDIANA USA

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

2

CONTENTS

Motivation

Related work

Preliminary work

Temporal Spread

Future work

Conclusion

7/26/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

3

A FABLE FROM WAYBACK

7/26/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

4

TEMPORAL SPREAD

7/26/13

2005-05-1401:36:08

+9 days

+18 days +18 days

+7 months

+2.1 years

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

5

QUESTIONS• How much temporal spread exists in composite

mementos?

• How can temporal spread be minimized?

• What factors contribute, positively or negatively, to spread?

• Does combining multiple archives produce better results?

• Would users with differing goals benefit from different minimization policies and heuristics?

• How can temporal coherence be displayed to users—simply?

7/26/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

6

CONTENTS

Motivation

Related work

Preliminary work

Temporal Spread

Future work

Conclusion

7/26/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

7

RELATED WORKControl Crawl Data Quality, Future collections

• Spaniol et al. – crawling strategy • Denev et al. – change rates by MIME type and

depth• Ben Saad et al. – metadata from crawl used to

select best results from archive

Our Focus: Existing Data Quality• Existing collections• Datetime selection policies

7/26/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

8

RELATED WORKUse Patterns

• AlNoamony et al. – Archive Access Patterns• Humans vs. Robots• Dip, dive, slide, & skim

Identifying Duplicates• Simple identity – images, other binary formats

• direct comparison• Hash comparison

• HTML, CSS (text)• Shingling, Jaccard distances, etc.• SimHash most promise ⃪�

7/26/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

9

RELATED WORK – MEMENTO*• HTTP extension for datetime negotiation

Request

Response

7/26/13

GET <timegate>/http://www.cs.odu.edu/ HTTP/1.1…Accept-Datetime: Sat, 10 May 2005 11:21:00 GMT…

HTTP/1.1 200 OK…Memento-Datetime: Sat, 14 May 2005 01:36:08 GMT…

*https://datatracker.ietf.org/doc/draft-vandesompel-memento/

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

10

CONTENTS

Motivation

Related work

Preliminary work How much of the Web is archived Temporal Drift

Temporal Spread

Future work

Conclusion

7/26/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

11

HOW MUCH IS ARCHIVED?

7/26/13

35 – 90% At least one archived copy

17 – 49% 2 – 5 copies

1 – 8% 6 – 10 copies

8 – 63% > 10 copies JCDL’11

Internet Archive Search Engine Other

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

12

CONTENTS

Motivation

Related work

Preliminary work How much of the Web is archived Temporal Drift

Temporal Spread

Future work

Conclusion

7/26/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

13

TEMPORAL DRIFTComparing two policies

• Sliding –target datetime changes• Sticky – target datetime held steady

7/26/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

14

SLIDING TARGET

7/26/13

2005-05-14 01:36:08

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

15

SLIDING TARGET

7/26/13

2005-04-2200:17:52

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

16

SLIDING TARGET

7/26/13

2005-03-3109:16:10

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

17

TEMPORAL DRIFTWHAT WE EXPECTED2005-05-14 @ 01:36:08

WHAT WE GOT2005-03-31 @ 09:16:10

7/26/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

18

STICKY TARGET

What if the target is held steady?

(Enabled by Memento API)

7/26/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

19

2005-05-14STICKY TARGET

7/26/13

Mem

ento

Fo

x E

xten

sio

n2005-05-14

01:36:08

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

20

STICKY TARGET

7/26/13

2005-04-2200:17:52

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

21

STICKY TARGET

7/26/13

2005-05-1401:36:08

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

22

DRIFT COMPARISON

PageSliding Sticky

Datetime Drift Datetime Drift

CS Home2005-05-14

01:36:08– 2005-05-14

01:36:08–

Science Home

2005-04-2200:17:52

22.1 days 2005-04-2200:17:52

22.1 days

CS Home2005-03-31

09:16:1043.7 days(+21.6 days)

2005-05-1401:36:08

Mean 32.9 days 11.0 days

7/26/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

23

MEDIAN DRIFT BY STEP

● Sliding● Sticky

Med

ian

Drif

t (m

onth

s)

7/26/13

Step Number

JCDL’13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

24

CONTENTS

Motivation

Related work

Preliminary work How much of the Web is archived Temporal Drift

Temporal Spread

Future work

Conclusion

7/26/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

25

TEMPORAL SPREAD

7/26/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

26

COMPOSITE MEMENTO

PRESENTATION STRUCTURE

7/26/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

27

TEMPORAL SPREAD

7/26/13

2005-05-1401:36:08

+9 days

+18 days +18 days

+7 months

+2.1 years

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

28

EMBEDDED RESOURCESResource Memento-Datetime Delta Resource

Memento-Datetime

Delta

http://www.cs.odu.edu 2005-05-14 01:36:08 spacer.gif 2005-06-01 16:23:10 18.6 d

mm_menu.js 2005-05-23 02:39:12 9.0 d jimcheng.gif 2005-06-01 16:37:39 18.6 d

style.css 2005-05-23 02:39:39 9.0 d jsmith.gif 2005-06-01 16:58:50 18.6 d

gfx-logo-odu-crown.gif 2005-05-23 02:39:39 9.0 d rmenu_1st_featured_alumni.png 2005-06-01 21:21:45 18.8 d

ddmenu_ddown.js 2005-05-23 02:39:43 9.0 d hmenu_college_...-new.png 2005-12-21 20:14:25 7.3 mo

university.js 2005-05-23 02:39:56 9.0 d rmenu_1st_upcoming_news.png 2005-12-21 20:15:14 7.3 mo

rmenu_1st_about.png 2005-06-01 13:40:25 18.5 d rmenu_1st_upcoming_events.png 2005-12-21 21:01:12 7.3 mo

rmenu_bottom_229.gif 2005-06-01 14:07:29 18.5 d lmenu_1st_resources.png 2005-12-28 17:47:41 7.5 mo

shadow-bl.gif 2005-06-01 14:55:53 18.6 d bullet_blue_triangle.gif 2005-12-28 19:43:48 7.5 mo

ecsbdg.jpg 2005-06-01 14:56:17 18.6 d logo-cs.gif 2005-12-28 19:54:29 7.5 mo

shadow-br.gif 2005-06-01 15:18:18 18.6 d rmenu_1st_featured_student.png 2007-06-12 02:36:07 2.1 years

gfx-btn-go-dblue.gif 2005-06-01 15:34:19 18.6 d shadow-b.gif 2007-06-21 02:35:17 2.1 years

shadow-tr.gif 2005-06-01 15:55:57 18.6 d shadow-r.gif 404 Not Found

header-right1.gif 2005-06-01 16:06:16 18.6 d

7/26/13

Embedded Resources 26

Mean Delta 125.9 days

Standard Deviation 207.7 days

Spread 2.1 years

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

29

REPRESENTING SPREAD

COMPOSITE MEMENTO

TEMPORAL SPREAD CHART

7/26/13

RootEmbeddedDiff. DomainReused

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

30

TEMPORAL SPREAD – ODU CS

7/26/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

31

FIRST EXPERIMENT

• 1,000 URIs from DMOZ (Open Directory)• Download all timemaps• Download all composite mementos• Download all embedded resources• Single and Multiple Archives• Four Heuristics

7/26/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

32

PRELIMINARY RESULTSCount Description Percent

1,000 Root URI-Rs

910 Root timemaps 91%

87,847 Root URI-Ms in timemaps

96.5 URI-Ms per Root URI-R

85,570 Root memento downloaded 97%

1,488,420 Embedded URI-Rs

17.4 Embedded URI-Rs per Root memento

7/26/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

33

SINGLE/MULTI & HEURISTICSDescription Minimize

Distance, Single

Archive

Minimize Distance,

Multi-Archive

3-Month Window,

Multi-Archive

Embedded URI-Rs 1,488,440 1,488,420 1,447,351

Embedded URI-Ms in timemaps 1,169,787 1,186,456 500,541

URI-M/Embedded URI-R 0.79 0.80 0.35

% Complete 73.8% 75.4% 33.8%

Mean spread 200.2 200.1 15.1

Standard Deviation 219.2 219.9 14.3

7/26/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

34

TEMPORAL COHERENCE

7/26/13

1 Memento, Bracketed Root

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

35

TEMPORAL COHERENCE

7/26/13

1 Memento, Bracketed Root

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

36

TEMPORAL COHERENCE

7/26/13

1 Memento, Bracketed Root

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

37

TEMPORAL COHERENCE

7/26/13

1 Memento, Root Not Bracketed

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

38

TEMPORAL COHERENCE

7/26/13

1 Memento, Root Not Bracketed

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

39

TEMPORAL COHERENCE

7/26/13

1 Memento, No Last-Modified

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

40

TEMPORAL COHERENCE

7/26/13

1 Memento, Before Root

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

41

TEMPORAL COHERENCE

7/26/13

2 Mementos, Root Not Bracketed

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

42

TEMPORAL COHERENCE

7/26/13

2 Mementos, Root Not Bracketed

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

43

TEMPORAL COHERENCE

7/26/13

2 Mementos, Use Content – Similarity

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

44

TEMPORAL COHERENCE

7/26/13

2 Mementos, Contents Equal or Equivalent

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

45

TEMPORAL COHERENCE

7/26/13

2 Mementos, Contents Not Equal or Equivalent

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

46

CURRENT EXPERIMENT

• 4,000 URIs from JCDL’11 “How Much…” paper• 1 URI/month vice all• Temporal coherence patterns• Target WSDM 2013

7/26/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

47

CURRENT EXPERIMENT

7/26/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

48

CONTENTS

Motivation

Related work

Preliminary work

Temporal Spread

Future work

Conclusion

7/26/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

49

FUTURE WORKTimemaps, Redirection, Missing Mementos

• Timemaps only tell part of the story

• URI-R redirection (302 from source)

• URI-M redirection (Archive action)

• Mementos in timemaps but not accessible

• Policies must consider user needs• Leave it missing• Show “best” substitute

7/26/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

50

FUTURE WORKSimilarity & Duplication

• Delta are currently | root – embedded |

• If bracketing mementos are identical,should delta be zero?

• HTML is usually modified by the archive

• Can’t check for equality

• Shingling? SimHash?

7/26/13

0 +30d–30d

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

51

FUTURE WORKCommunicating Status

7/26/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

52

FUTURE WORKPolicies & Heuristics

• Current Spread Heuristics• Minimize distance• Past only• Past preferred• Near or within distance• Single vs. multi-archive

• Refine to meet user expectations• Speed (minimize time)• Accuracy (minimize temporal error)

7/26/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

53

CONTENTS

Motivation

Related work

Preliminary work

Future work

Conclusion

7/26/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

54

CONCLUSION

Extensive research on improving acquisition

exists

Best use of existing collections needs study

We are looking at

• Characterizing existing holdings

• Characterizing temporal coherence

• Policies that minimize impact of temporal

incoherence

• Visualizations of temporal coherence

7/26/13

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

55

MY QUESTIONS

7/26/13

Coherent

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

56

MY QUESTIONS

7/26/13

Violation

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

57

MY QUESTIONS

7/26/13

What do

these mean

to users?

(3)

(2)

(1)

(4)

Join

t C

onfe

renc

e o

n D

igita

l Lib

rarie

s (J

CD

L) 2

013

Scott G. Ainsworth • Michael L. Nelson

58

MY QUESTIONS

7/26/13

What does

this mean

to users?

top related