profiling web archives iipc ga 2015

45
Profiling Web Archives memento and Computer Science Department, Old Dominion University Norfolk, Virginia - 23529 Sawood Alam Michael L. Nelson Los Alamos National Laboratory, Los Alamos, NM Herbert Van de Sompel Stanford University Libraries, Stanford, CA David S. H. Rosenthal

Upload: sawood-alam

Post on 16-Jul-2015

962 views

Category:

Internet


0 download

TRANSCRIPT

Memento Aggregator

mementoAggregates ~20 archives and countingOnly a few archives return good resultsfor any queryTime, network, and resource wastageQuery routing can be helpful

Long Tail Matters400B+ web pages at IA donot cover everythingTop three archives after IAproduce full TimeMap52% of the timeTargeted crawlsSpecial focus archivesRestricted resourcesPrivate archives

The Portuguese Web Archive and Memento unveil the first homepage of the Smithsonian Institution from May 1995... fb.me/3VAo6gEba1:12 PM ­ 5 Jan 2015

    8   1

PortugueseWebArchive @PT_WebArchive

 Follow

Dennis Ritchie's Homepage has been deleted: cm.bell­labs.com/cm/cs/who/dmr/ ­ and the site has a robots.txt that blocks it from the Wayback.2:37 PM ­ 22 Apr 2015

    76   23

Jason Scott @textfiles

 Follow

Memento Workflow

Memento Workflow with Profile

Available Profiling ResourcesClient requestArchive responseArchive index (CDX files)

A Client RequestCanonical URLAccept-Datetime (optional)Accept-Language (optional)

G E T / t i m e g a t e / h t t p : / / w w w . c n n . c o m / H T T P / 1 . 1H o s t : m e m e n t o w e b . o r gA c c e p t : t e x t / h t m l , a p p l i c a t i o n / x h t m l + x m l ; q = 0 . 9 , i m a g e / w e b p , * / * ; q = 0 . 8A c c e p t - E n c o d i n g : g z i p , d e f l a t e , s d c hA c c e p t - D a t e t i m e : S a t , 1 6 J u n 2 0 1 2 0 0 : 0 0 : 0 0 G M TA c c e p t - L a n g u a g e : e n - U S , e n ; q = 0 . 8C a c h e - C o n t r o l : m a x - a g e = 0I f - M o d i f i e d - S i n c e : T h u , 2 3 A p r 2 0 1 5 1 6 : 5 1 : 5 0 G M TI f - N o n e - M a t c h : " 7 f f 8 - 5 1 4 6 7 1 8 9 2 9 5 8 0 "C o n n e c t i o n : k e e p - a l i v eC o o k i e : _ _ u n a m = 3 4 c 3 c 7 d - 1 4 c e 9 1 7 c e 6 2 - 4 3 c 3 8 e 5 e - 7 . . .U s e r - A g e n t : M o z i l l a / 5 . 0 L i n u x x 8 6 _ 6 4 C h r o m e / 4 2 . 0 . 2 3 1 1 . 9 0 . . .

An Archive ResponseCanonical URL (known)Memento-DatetimeOriginal Content-Language (optional)

H T T P / 1 . 1 2 0 0 O KS e r v e r : T e n g i n e / 2 . 0 . 3D a t e : S u n , 2 6 A p r 2 0 1 5 0 0 : 2 5 : 5 7 G M TC o n t e n t - T y p e : t e x t / h t m l ; c h a r s e t = u t f - 8C o n t e n t - L e n g t h : 8 5 9 4 5C o n n e c t i o n : k e e p - a l i v es e t - c o o k i e : w a y b a c k _ s e r v e r = 3 7 ; D o m a i n = a r c h i v e . o r g ; P a t h = / ; E x p i r e s = T u e , 2 6 - M a y - 1 5 0 0 : 2 5 : 5 7 G M T ;M e m e n t o - D a t e t i m e : S a t , 2 5 A p r 2 0 1 5 1 3 : 3 8 : 1 6 G M TL i n k : ; r e l = " o r i g i n a l " , ; r e l = " t i m e m a p " ; t y p e = " a p p l i c a t i o n / l i n k - f o r m a t " , X - A r c h i v e - G u e s s e d - C h a r s e t : U T F - 8X - A r c h i v e - O r i g - v i a : 1 . 1 v a r n i s h , 1 . 1 v a r n i s h , 1 . 1 v a r n i s hX - A r c h i v e - O r i g - c o n t e n t - l a n g u a g e : e nX - A r c h i v e - O r i g - x - c o n t e n t - t y p e - o p t i o n s : n o s n i f fX - A r c h i v e - O r i g - v a r y : A c c e p t - E n c o d i n g , C o o k i eX - A r c h i v e - O r i g - c o n t e n t - t y p e : t e x t / h t m l ; c h a r s e t = U T F - 8X - A r c h i v e - O r i g - c a c h e - c o n t r o l : p r i v a t e , s - m a x a g e = 0 , m a x - a g e = 0 , m u s t - r e v a l i d a t eX - A r c h i v e - O r i g - s e r v e r : A p a c h e

A CDX SnippetCanonical URLMemento Datetime

c n n . c o m / 2 0 0 8 0 2 2 6 1 9 3 7 5 7 h t t p : / / w w w . c n n . c o m / t e x t / h t m l 2 0 0 2 Q 4 O Z S V K P Z M U F 3 6 U N 6 F B X F N G D K A R P A 7 N - - 1 0 3 6 8 8 9 2 A R C H I V E I T - 1 0 2 2 - M O L L Y A S T R I D - C A S T R O R E S I - 2 0 0 8 0 2 2 6 1 9 3 7 1 9 - 0 0 0 0 0 - c r a w l i n g 1 0 . u s . a r c h i v e . o r g . a r c . g zc n n . c o m / 2 0 0 9 0 3 1 4 0 2 4 0 3 6 h t t p : / / w w w . c n n . c o m / t e x t / h t m l 2 0 0 4 P V C G T 2 2 V V T D J 3 G X I J E U Z 3 J O J 4 H B Z Y B 4 - - 1 3 2 8 2 5 0 0 A R C H I V E I T - 1 0 2 3 - 2 0 0 9 0 3 1 4 0 2 4 0 1 5 - 0 0 0 5 8 - c r a w l i n g 1 0 5 . u s . a r c h i v e . o r g . w a r c . g zc n n . c o m / 2 0 0 9 0 3 1 4 0 2 4 0 3 6 h t t p : / / w w w . c n n . c o m / t e x t / h t m l 2 0 0 4 P V C G T 2 2 V V T D J 3 G X I J E U Z 3 J O J 4 H B Z Y B 4 - - 2 4 5 9 5 3 6 0 A R C H I V E I T - 1 0 2 3 - 2 0 0 9 0 3 1 4 0 2 4 0 0 3 - 0 0 0 5 7 - c r a w l i n g 1 0 5 . u s . a r c h i v e . o r g . a r c . g zi . c d n . t r a v e l . c n n . c o m / 2 0 1 3 0 1 0 2 0 8 3 5 5 4 h t t p : / / i . c d n . t r a v e l . c n n . c o m / t e x t / h t m l 2 0 0 D 2 M J U R 6 2 V J 5 D 6 C N L 5 P U D Q F E W 4 G I R G I X 2 - - 3 3 9 0 1 7 1 1 A R C H I V E I T - 1 0 2 3 - Q U A R T E R L Y - U G G V Z U - 2 0 1 3 0 1 0 2 0 8 0 2 5 2 - 0 0 0 0 7 - w b g r p - c r a w l 0 6 3 . u s . a r c h i v e . o r g - 6 6 8 2 . w a r c . g zi . c d n . t r a v e l . c n n . c o m / 2 0 1 3 0 4 0 4 1 7 2 9 1 3 h t t p : / / i . c d n . t r a v e l . c n n . c o m / t e x t / h t m l 2 0 0 J Z K L 7 H G G B N 7 3 B U X F I S E J L M 7 Y N A X E 7 M T I - - 2 7 4 5 0 8 0 8 1 A R C H I V E I T - 1 0 2 3 - Q U A R T E R L Y - 5 8 8 5 - 2 0 1 3 0 4 0 4 0 7 4 7 1 6 9 4 8 - 0 0 0 0 2 - w b g r p - c r a w l 0 6 7 . u s . a r c h i v e . o r g - 6 4 4 3 . w a r c . g z

Complete URI-R ProfilingSanderson et al. created a URIR profile for variousarchivesExtracted every URI-R from all the CDX filesGained complete knowledge of the holding of theparticipating archivesProfiles were hugeDifficult to keep up-to-dateMisses URI-Rs added later in the archive

TLD-only ProfilingAlSum et al. created a TLDprofile for various archivesCollected statistics aboutvarious archives onvarious TLDsLightweight profilesLots of false-positivesAll the ".com" queries willbe routed to an archivethat has only a few URI-Rswith ".com" TLD

Middle GroundPartial URI-Rs, such as:

Registered domain nameComplete domain name (along with any sub-domains)Complete domain name and first few path segments

Registered domain name and counts of other segmentssuch as sub-domain, path, and query parameterCombining above with other attributes such as Content-Language and Memento-Datetime

Archive ProfileHigh-level digest of an archivePredicts presence of mementos of a URI-R in an archiveProvides various statistics about the holdingsSmall in sizePublicly availableEasy to update and partially patchUseful for Memento query routing and other things

StructureA r c h i v e m e t a d a t aS t a t i s t i c s : P r o f i l e t y p e s : K e y s : F r e q u e n c y m e a s u r e m e n t s

Profile typesURI-R based

Complete URI-RTLD onlyURI-R hashes, such as:

Only first few segments of the URI-R (Sub-URI)Registered domain name along with counts of othersegments (Segment-Digest)

LanguageDatetimeMany more...

KeysDepend on the profile typeControl the balance between profile size and details

U R I - R : " b b c . c o . u k / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 "T L D : " . u k "S u b - U R I : " u k ) / " , " u k , c o ) / " , " u k , c o , b b c ) / " , " u k , c o , b b c ) / i m a g e s "S e g - D i g e s t : " 0 / b b c . c o . u k / 4 "L a n g u a g e : " e n - G B "D a t e t i m e : " 2 0 1 4 0 3 " # Y Y Y Y M M

Frequency MeasurementsCan have the same structure for all profile typesFlexible to choose the attribute set to be includedAffects the profile complexityPredicts the presence of the mementos of a URI-R

" u k , c o , b b c ) / " : u r i m : m a x : 2 m i n : 1 t o t a l : 1 2 8 u r i r : 1 1 5

Horizontal and Vertical Holdings" u k , c o , b b c ) / " : u r i m : m a x : 1 0 0 m i n : 1 0 0 t o t a l : 1 0 0 u r i r : 1

" u k , c o , b b c ) / " : u r i m : m a x : 1 m i n : 1 t o t a l : 1 0 0 u r i r : 1 0 0

" u k , c o , b b c ) / " : u r i m : m a x : 2 0 m i n : 5 t o t a l : 1 0 0 u r i r : 1 0

Sample Profile- - -" @ c o n t e x t " : " h t t p s : / / o d u w s d l . g i t h u b . i o / c o n t e x t / a r c h p r o f i l e . j s o n l d "" @ i d " : " h t t p : / / w w w . w e b a r c h i v e . o r g . u k / u k w a / "a b o u t : a c c e s s p o i n t : " h t t p : / / w w w . w e b a r c h i v e . o r g . u k / w a y b a c k / " m e c h a n i s m : " h t t p : / / o d u w s d l . g i t h u b . i o / t e r m s / m e c h a n i s m # c d x " n a m e : " U K W A 1 9 9 6 C o l l e c t i o n " p r o f i l e _ u p d a t e d : " 2 0 1 5 - 0 1 - 2 0 T 1 7 : 2 5 : 3 0 Z " s u b u r i _ c l a s s : " h t t p : / / o d u w s d l . g i t h u b . i o / t e r m s / s u b u r i # H 3 P 1 " m o r e _ m e t a _ d a t a : " . . . "s t a t s : l a n g u a g e : " e n - U S " : u r i m : { m a x : 1 3 , m i n : 1 , t o t a l : 4 7 5 2 9 } u r i r : 2 5 6 2 1 " m o r e _ l a n g u a g e s " : " . . . " s u b u r i : " u k ) / " : u r i m : { m a x : 8 , m i n : 1 , t o t a l : 9 3 2 4 3 2 } u r i r : 8 6 7 8 1 7 " u k , c o ) / " : u r i m : { m a x : 8 , m i n : 1 , t o t a l : 4 1 0 9 7 9 } u r i r : 3 7 8 6 8 6

URI-R Based ProfilesURI-R preprocessing

CanonicalizeApply SURTSplit segmentsExtract registered domainCount segments (sub-domain, path, query params)

Generate all Sub-URIsIncrementally add segments from left-to-rightOnly up to max host and path segments config

Create Segment-Digest with registered domainPrefix sub-domain countSuffix path and query params count

Key Generationhttps://www.BBC.co.uk/images/Logo.png?width=200&height=80#f

Intermediate Values{ c a n o n i c a l _ u r l : " b b c . c o . u k / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 " , s u r t _ u r l : " u k , c o , b b c ) / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 " , r e g _ d o m a i n : " b b c . c o . u k " , p a t h _ i n i t i a l : " i " , s u b d o m a i n _ c o u n t : 1 , p a t h _ c o u n t : 2 , q u e r y _ p a r a m s _ c o u n t : 2 }

Sub-URI(H 3 P 1 )[ " u k ) / " , " u k , c o ) / " , " u k , c o , b b c ) / " , " u k , c o , b b c ) / i m a g e s " ]

SegDigest( include_path_initial)" 1 / b b c . c o . u k / i 4 "

ImplementationGitHub:

A python module to generate Sub-URIs from SURTGitHub:

Various profile generation scripts

/oduwsdl/suburi_generator

/oduwsdl/archive_profiler

CanonicalizationRemove "http(s)", "www", and fragment of a URIDowncase hostnameRemove some known query paras e.g., "jsessionid"Sort query params by keys and values (secondary)

U R L = " h t t p s : / / w w w . B B C . c o . u k / i m a g e s / L o g o . p n g ? w i d t h = 2 0 0 & h e i g h t = 8 0 # f "C a n o n i c a l i z e ( U R L )# = > " b b c . c o . u k / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 "

Sort-friendly URI ReorderingTransform (SURT)

Take canonical URL as inputJoin hostname segments by commas in reverse orderSeparate hostname and path by closing parenthesis

C a n _ U R L = " b b c . c o . u k / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 "S U R T ( C a n _ U R L )# = > " u k , c o , b b c ) / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 "

Sub-URITake SURT URL as inputIncrementally add segments from left-to-right one-by-oneStop if hostname or path segment limit policy reachesReturn the list of all Sub-URIsS U R T _ U R L = " u k , c o , b b c ) / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 "S u b U R I ( S U R T _ U R L , p o l i c y = " H 3 P 1 " )# = > [ " u k ) / " ,# " u k , c o ) / " ,# " u k , c o , b b c ) / " ,# " u k , c o , b b c ) / i m a g e s " ]

URL to Sub-URIU R L = " h t t p s : / / w w w . B B C . c o . u k / i m a g e s / L o g o . p n g ? w i d t h = 2 0 0 & h e i g h t = 8 0 # f "

C a n _ U R L = C a n o n i c a l i z e ( U R L )# = > " b b c . c o . u k / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 "

S U R T _ U R L = S U R T ( C a n _ U R L )# = > " u k , c o , b b c ) / i m a g e s / L o g o . p n g ? h e i g h t = 8 0 & w i d t h = 2 0 0 "

S u b _ U R I s = S u b U R I ( S U R T _ U R L , p o l i c y = " H 3 P 1 " )# = > [ " u k ) / " ,# " u k , c o ) / " ,# " u k , c o , b b c ) / " ,# " u k , c o , b b c ) / i m a g e s " ]

Segment Count DigestExtract registered domain name and initial letter of pathCount sub-domain and trailing (path + query) segmentsSerialize as follows:{ s u b d o m a i n _ c o u n t } / { r e g _ d o m a i n } / { p a t h _ i n i t i a l } ? { t r a i l i n g _ c o u n t }

U R L = " h t t p s : / / w w w . B B C . c o . u k / i m a g e s / L o g o . p n g ? w i d t h = 2 0 0 & h e i g h t = 8 0 # f "S e g m e n t s = S e g m e n t i z e ( U R L )# = > { r e g _ d o m a i n : " b b c . c o . u k " ,# p a t h _ i n i t i a l : " i " ,# s u b d o m a i n _ c o u n t : 1 ,# p a t h _ c o u n t : 2 ,# q u e r y _ p a r a m s _ c o u n t : 2 ,# t r a i l i n g _ c o u n t : 4 }S e g D i g e s t ( S e g m e n t s , p o l i c y = " e x c l u d e _ p a t h _ i n i t i a l " )# = > " 1 / b b c . c o . u k / 4 "S e g D i g e s t ( S e g m e n t s , p o l i c y = " i n c l u d e _ p a t h _ i n i t i a l " )# = > " 1 / b b c . c o . u k / i 4 "

JSON SerializationCan have complex nesteddata structureJSON-LD for linked dataNo partial key lookupUnsuitable for textprocessing toolsAllows processing onlywhen fully loadedA single malformedcharacter makes itunparsableDifficult to patch

{ " s u b u r i " : { " u k ) / " : { " u r i m " : { " m a x " : 8 , " m i n " : 1 , " t o t a l " : 9 3 2 4 3 2 } , " u r i r " : 8 6 7 8 1 7 } , " u k , c o ) / " : { " u r i m " : { " m a x " : 8 , " m i n " : 1 , " t o t a l " : 4 1 0 9 7 9 } , " u r i r " : 3 7 8 6 8 6 } , " u k , c o , b b c ) / " : { " u r i m " : { " m a x " : 2 , " m i n " : 1 , " t o t a l " : 1 2 8

CDX-JSON SerializationFusion of CDX and JSON file formatsA key followed by strict single line JSON valueUnlike CDX, values can have arbitrary attributesText processing tool friendlyNo single root node or single document restrictionsEnables binary searchEnables partial key lookupError resilient

@ c o n t e x t " h t t p s : / / o d u w s d l . g i t h u b . i o / c o n t e x t s / a r c h i v e p r o f i l e . j s o n l d "@ i d " h t t p : / / w w w . w e b a r c h i v e . o r g . u k / u k w a / "@ a b o u t { " n a m e " : " U K W A 1 9 9 6 C o l l e c t i o n " , " t y p e " : " s u b u r i # H 3 P 1 " , " . . . " : u k ) / { " u r i m " : { " m a x " : 8 , " m i n " : 1 , " t o t a l " : 9 3 2 4 3 2 } , " u r i r " : 8 6 7 8 1 7 } ,u k , c o ) / { " u r i m " : { " m a x " : 8 , " m i n " : 1 , " t o t a l " : 4 1 0 9 7 9 } , " u r i r " : 3 7 8 6 8 6u k , c o , b b c ) / { " u r i m " : { " m a x " : 2 , " m i n " : 1 , " t o t a l " : 1 2 8 } , " u r i r " : 1 1 5 } ,u k , c o , b b c ) / i m a g e s { " u r i m " : { " m a x " : 1 , " m i n " : 1 , " t o t a l " : 3 } , " u r i r " :

MergingOnly process new data to periodically update forfreshnessParallel processingDifficult to keep detailed measures with absolute valuesDerived simple heuristic measures to predict presence ofmementos

Merging ExampleBase Profile

c o m , c n n ) / { " u r i r _ s u m " : 3 0 , " s o u r c e s " : 1 } ,u k , c o , b b c ) / { " u r i r _ s u m " : 2 0 , " s o u r c e s " : 1 }

New Profilec o m , c n n ) / { " u r i r _ s u m " : 1 0 , " s o u r c e s " : 1 } ,c o m , u s a t o d a y ) / { " u r i r _ s u m " : 5 , " s o u r c e s " : 1 }

Merged Profilec o m , c n n ) / { " u r i r _ s u m " : 4 0 , " s o u r c e s " : 2 } ,u k , c o , b b c ) / { " u r i r _ s u m " : 2 0 , " s o u r c e s " : 1 } ,c o m , u s a t o d a y ) / { " u r i r _ s u m " : 5 , " s o u r c e s " : 1 }

DatasetTwo archivesThree sample query setsVarious profiles for each archive and sample set

ArchivesArchive URI-Rs URI-Ms Size

Archive-It 1.9B 5.3B 1.8TB

UKWA 0.7B 1.7B 0.5TB

Sample Query SetsSample Size In Archive-It In UKWA

DMOZ 100,000 4,042 1,896

MementoProxy 100,000 4,222 193

IAWayback 100,000 3,999 275

EvaluationRelate CDX Size, URI-M, URI-R, and Sub-URIAnalyze profile growthEstimate Relative CostEvaluate Routing Precision vs. Relative Cost

Relative Cost = |Keys in the Profile||URI-R in the Archive|

Routing Precision = |URI-R Present in the Archive||URI-R Predicted by the Profile in Archive|

UKWA Dataset

Yearly data as seprate collectionsAverage CDX line size: 275 bytesURI-M/URI-R ratio: 2.46

Accumulated URI-R Growth (UKWA)Successive yearly datawas mergedFollows Heaps' Law

K = 3.897β = 0.892

= KCr Cβm

Sub-URI Key Growth (UKWA)Slope of the fit line is theRelative Cost for theprofile policyComplete URI-R profilehas Relative Cost 1

Cost Analysis

Search Precision of Various Profiles

Search Precision wrt TLD-only profileDouble for H3P0Five fold for HxP1

Segment-Digest is as good as H3P0

Relative Cost vs. Search Precision

Up to 22% routing precision with <5% Reltive Cost<0.3% sample URIs from MementoProxy and IAWaybacklogs present in UKWAShallow crawling of UKWA results in higher cost

Relative Profile Cost (UKWA)Profile Cost Profile Cost Profile Cost

H1P0 3.2e-06 H3P2 0.26823 HxP2 0.38313

H2P0 0.00027 H3P3 0.37343 HxP3 0.53928

H2P1 0.00059 H4P0 0.01348 HxP4 0.63889

H2P2 0.00099 H5P0 0.01388 HxP5 0.71568

H3P0 0.00862 HxP0 0.01401 HxPx 0.83107

H3P1 0.11864 HxP1 0.16349 URIR 1.00000

Future WorkGenerating sample URI setsProfiling via samplingLanguage profilesEvaluation of combination profiles such as Sub-URI alongwith DatetimeProfiles for usage other than Memento routing, such as,

Media-type profiles (e.g., images, pdf, audio etc.)Site classification based profiles (e.g., news, wiki, socialmedia, blog etc.)

ConclusionsGenerated profiles with different policies for two archivesExamined cost-accuracy trade-offs of various profilesRelated CDX Size, URI-M, URI-R, and Sub-URIGained up to 22% routing precision with <5% relative costwithout any false negatives<5% of the queried URIs are present in each of theindividual archivesImplementation codes are available at:

GitHub:GitHub:

/oduwsdl/suburi_generator/oduwsdl/archive_profiler