web archiving challenges and opportunities

82
WEB ARCHIVING CHALLENGES & OPPORTUNITIES PRESENTATION FOR WEB ARCHIVING ENGINEERING POSITION Ahmed AlSum PhD Candidate Old Dominion University

Upload: ahmed-alsum

Post on 28-Jan-2015

103 views

Category:

Software


0 download

DESCRIPTION

This is my presentation for job interview as web archiving engineer at Stanford university libraries on Oct 25.

TRANSCRIPT

Page 1: Web archiving challenges and opportunities

WEB ARCHIVING CHALLENGES & OPPORTUNITIESPRESENTATION FOR WEB ARCHIVING ENGINEERING POSITION

Ahmed AlSumPhD Candidate

Old Dominion University

Page 2: Web archiving challenges and opportunities

Outline• Engineering Experience

• IBM• Old Dominion University• Internet Archive

• Web Archiving Challenges & Opportunities• Selection• Harvesting• Storage• Access• Community

• Conclusions

Page 3: Web archiving challenges and opportunities

Cairo, Egypt2006 - 2009

Page 4: Web archiving challenges and opportunities

CCSP Project• An internal IBM support portal that provides client-facing

audiences a by-client, holistic view of client situations• Technologies: WebSphere Portal, DB2, deployed on

zLinux machines

Page 5: Web archiving challenges and opportunities

Responsibilities• Software Engineer

• Enterprise Applications with J2EE platform technologies for frontend (Servlets, JSP, Portlet APIs), and backend tasks based on EJB

• Front-end components based on Web 20 technologies (AJAX based on dojo 1.0, and Java Script)

• Lotus Sametime (Plugins and Bot development)

• Software engineer team leader• Support project quality activities• Lead code review and static analysis activities

Page 6: Web archiving challenges and opportunities

Responsibilities• Administrator

• Deploying Portal solutions on WebSphere Portal• WebSphere Portal Administration for standalone and clustered

environment• Administration on Linux and Windows OS• DB2 server administration for single instance and multiple

instances with HADR support

• Customer support team lead• Leading customer support activities

Page 7: Web archiving challenges and opportunities

Certifications

Page 8: Web archiving challenges and opportunities

Sharing IBM Internal Solutions with Broader Community

Page 9: Web archiving challenges and opportunities

Norfolk, VA USA2009 - 2013

Page 10: Web archiving challenges and opportunities

Memento• Memento is an HTTP

extension to integrate the Past and the Current Web

I Jacobs and N Walsh Architecture of the world wide web Technical report, W3C, 2004 http://wwww3org/TR/webarch/

Now

T1

T2

T3

Page 11: Web archiving challenges and opportunities

Memento

• Developer and administrator for Memento aggregator and proxies

Page 12: Web archiving challenges and opportunities

Memento Clients

• Memento currently is I-D draft, it is promoted to move to RFC soon.

Page 13: Web archiving challenges and opportunities

San Francisco, CA USA2012

Page 14: Web archiving challenges and opportunities

WAT Extraction• Web Archive Transformation (WAT) is a specification for

structuring metadata generated by Web crawls• Technologies:

Page 15: Web archiving challenges and opportunities

WEB ARCHIVING

Challenges and Opportunities

Page 16: Web archiving challenges and opportunities

Web Archive Life Cycle

Hockx-Yu, H, 2011 The Past Issue of the Web In Proceedings of 3rd International Conference on Web Science pp 1–8

Page 17: Web archiving challenges and opportunities

Selection• Decide what to capture

Everything, any domain

National domains

Delegate selection to partners

Users’ favorites

• We studied what is already captured

Page 18: Web archiving challenges and opportunities

How Much Of The Web Is Archived?

S. G. Ainsworth, A. AlSum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson

In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries, JCDL '11, Ottawa, Canada 2011

See also: http://arxiv.org/abs/1212.6177

Page 19: Web archiving challenges and opportunities

Archive categories

We have 3 categories of archives• Internet Archive (classic interface) • Search engine • Other archives

Selection

UK

US

Public Archives, ca. Late 2010 / Early 2011

Page 20: Web archiving challenges and opportunities

1000 URIs Ordered by First Observation Date

Selection

See also: http://ws-dl.blogspot.com/2011/06/2011-06-23-how-much-of-web-is-archived.html

Page 21: Web archiving challenges and opportunities

Memento Distribution, ordered by the first observation date

Page 22: Web archiving challenges and opportunities

How Much of the Web is Archived?It Depends on Which Web…

Selection

Including SE cache

Excluding SE Cache

90% 79%

97% 68%

88% 19%

35% 16%

Changes since 2011: no more free SE APIs; greatly reduced IA quarantine period; 15 public web archives

2013

95%

92%

23%

26%

Page 23: Web archiving challenges and opportunities

Profiling Web Archive Coverage For Top-level Domain And Content Language

A. AlSum, M. C. Weigle, M. L. Nelson, and H. Van de Sompel

In Proceedings of the 17th International Conference on Theory and Practice of Digital Libraries, TPDL 2013, 2013

See also: http://arxiv.org/abs/1309.4008

Page 24: Web archiving challenges and opportunities

Where is it archived?

Selection

IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web

LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University

IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It

Page 25: Web archiving challenges and opportunities

Language Coverage

Selection

IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web

LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University

IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It

Page 26: Web archiving challenges and opportunities

Growth Rate

Selection

IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web

LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University

IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It

Borrowed Portuguese material from IA

Stopped archiving since 2008

Steady growth

Stopped getting new URIs, but still crawling

Page 27: Web archiving challenges and opportunities

Selection Research Output• Some portions of the web are

not well archived such as India and Africa.

• Profiling helping us in Memento query routing.

• IIPC proposal with Herbert Van de Sompel (LANL) and David Rosenthal (SUL).

Selection

Page 28: Web archiving challenges and opportunities

Selection at SUL• Focus on the missing parts of the Web• Twitter - Crowdsource:

• UK Web archive: Twittervana• Internet Memory: Collect URIs from twitter APIs• VA Tech: CTRNET project

• Stanford Community• World News collection: 10 news website from each county

• Tools:

Selection

Page 29: Web archiving challenges and opportunities

Web Archive Life Cycle

Hockx-Yu, H, 2011 The Past Issue of the Web In Proceedings of 3rd International Conference on Web Science pp 1–8

Page 30: Web archiving challenges and opportunities

Harvesting• Services

• Archive-It• WAS @ CDLib

• Dedicated servers

• New tools

See also: http://ws-dl.blogspot.com/2013/07/2013-07-10-warcreate-and-wail-warc.html

Page 31: Web archiving challenges and opportunities

Special Harvesting Techniques• Borrow old materials from other web archives• Ex Stanford WebBase Project*

• 260 TB• 7 Billion webpages

Harvesting

*http://www-diglib.stanford.edu/~testbed/doc2/WebBase/

Page 32: Web archiving challenges and opportunities

Special Harvesting Techniques• Social Media

• Focus on shared resources in the social media

Harvesting

Hany M SalahEldeen, Michael L Nelson, Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost?, Proceedings of TPDL 2012http://ws-dl.blogspot.com/2012/02/2012-02-11-losing-my-revolution-year.html

Page 33: Web archiving challenges and opportunities

Special Harvesting Techniques• SiteStory - Transactional Archive

Harvesting

Justin F Brunelle, Michael L Nelson, Lyudmila Balakireva, Robert Sanderson, Herbert Van de Sompel, Evaluating the SiteStory Transactional Web Archive With the ApacheBench Tool, Proceedings of TPDL 2013Sitestory: http://mementoweb.github.io/SiteStory/

Page 34: Web archiving challenges and opportunities

Harvesting • Challenges

• Ajax and Web 2.0/3.0• Streaming Media• URI challenges • Mobile

Harvesting

http://blog.dshr.org/2012/05/harvesting-and-preserving-future-web.htmlhttp://netpreserve.org/sites/default/files/resources/OverviewFutureWebWorkshop.pdf

Page 35: Web archiving challenges and opportunities

Web Archive Life Cycle

Hockx-Yu, H, 2011 The Past Issue of the Web In Proceedings of 3rd International Conference on Web Science pp 1–8

Page 36: Web archiving challenges and opportunities

Storage (Format)• Flat files:

• WARC files (ISO standard)

• No-SQL db:• Hbase at Internet memory*

• Storage at SUL:• We need to use both

Storage

*Philippe Rigaux, Understanding HBase— The data model, IM technology blog http://internetmemoryorg/en/indexphp/synapse/understanding_the_hbase_data_model/

Page 37: Web archiving challenges and opportunities

Storage (Infrastructure)• Wrong solution could be a disaster

Storage

Page 38: Web archiving challenges and opportunities

Web Archive Life Cycle

Hockx-Yu, H, 2011 The Past Issue of the Web In Proceedings of 3rd International Conference on Web Science pp 1–8

Page 39: Web archiving challenges and opportunities

Accessing Web Archive

URI-BasedWayBack Machine

• Textbox to enter the requested URI

• BubbleMap to show you the available mementos

Page 40: Web archiving challenges and opportunities

Accessing Web Archive

Full-text search

• Challenges: Temporal Page Rank, Rank per site or memento, Date filtering

Page 41: Web archiving challenges and opportunities

Accessing Web Archive• Thumbnail View

• Trade-off between building the thumbnail in real time or pre-building Also, trade-off between representing the thumbnail by URI or by embedded binary data Can we build partial thumbnail map?

Page 42: Web archiving challenges and opportunities

Accessing Web Archive• Title View

• Trade-off between, extracting all the titles and keeping it as a metadata about the memento and extracting the title from the HTML content on the real time

Implemented using Simile: http://www.simile-widgets.org/timeline/

Page 43: Web archiving challenges and opportunities

Accessing Web Archive• Wayback Machine API

• XML interface for the list of available Mementos

Page 44: Web archiving challenges and opportunities

Accessing Web Archive• Web Page Snapshot Replay

• URI rewriting, javascript, and embedded resources

Page 45: Web archiving challenges and opportunities

Accessing Web Archive• Page Completeness Degree

• The completeness degree could be calculated on the real time by using the preserved HTTP status for the embedded resources

See also: http://arxiv.org/abs/1309.5503

Page 46: Web archiving challenges and opportunities

Accessing Web Archive• Reconstructing web site

• Current approach is using the web archive public interface.

Page 47: Web archiving challenges and opportunities

Accessing Web Archive• Wayback Annotator

• Create collections• Select and save

relevant content to their collections

• Annotate & mark important parts of archived web pages

• Share their work and collaborate on archived content use

http://netpreserve.org/sites/default/files/resources/Predstavitev_07.pdfhttp://netpreserve.org/sites/default/files/resources/Wayback_annotator_06.pdf

Page 48: Web archiving challenges and opportunities

Accessing Web Archive

Collection-Based

• In addition to browsing the collection, you can browse the URIs in this collection

• Research questions: Collection overview

Page 49: Web archiving challenges and opportunities

Accessing Web Archive• Collection visualization

• Term frequency algorithms should be normalized to take the mementos density in consideration

http://ws-dl.blogspot.com/2012/08/2012-08-10-ms-thesis-visualizing.html

Page 50: Web archiving challenges and opportunities

Accessing Web Archive• Web Archive analytics

See also: http://ilpubs.stanford.edu:8090/1037/1/arcspread.pdf

• ArcSpread took a query from the user, extracted related information and displayed the results in spread sheet style.

Page 51: Web archiving challenges and opportunities

Who And What Links To The Internet Archive

Y. Alnoamany, A. AlSum, M. C. Weigle, M. L. Nelson

In Proceedings of 17th International Conference on Theory and Practice of Digital Libraries, TPDL 2013, 2013

(Best Student Paper)See also: http://arxiv.org/abs/1309.4016

Page 52: Web archiving challenges and opportunities

Serving Robots!• Log files analysis using Apache Pig• Access to IA wayback machine as

Robots outnumber Humans • 10:1 in terms of sessions, • 5:4 in terms of raw HTTP accesses • 4:1 in terms of megabytes transferred

Access

Sessions

10

1

HTTP accesses

5

4

MB Transferred

4

1

Page 53: Web archiving challenges and opportunities

Where do Wayback Machine Users Come From?

Website Percentage Descriptionen.wikipedia.org 12.9% Wikipedia archive.org 11.9% IA Home Page reddit.com 10.2% Social News Web Site google.TLD 9.9% Search Engine info-poland.buffalo.edu 1.5% Polish Studies de.wikipedia.org 1.4% Wikipedia cracked.com 1.2% Humor Site snopes.com 1.1% Urban Legends Reference Pages facebook.com 0.9% Social Media crochetpatterncentral.com 0.9% Crocheting Hobbies

Access

Page 54: Web archiving challenges and opportunities

Most Languages Self-Link

Access

Page 55: Web archiving challenges and opportunities

ArcLink:Optimization Techniques To Build And Retrieve The Temporal Web Graph

A. AlSum, M. L. Nelson

IIPC GA 2013, Ljubljana, Slovenia

In Proceedings of the 13th international ACM/IEEE joint conference on Digital libraries, JCDL '13, 2013

See also: http://arxiv.org/abs/1305.5959

Page 56: Web archiving challenges and opportunities

Easy Solved Questions

Q: What are the available mementos for vancouver2010.com?

Access

Page 57: Web archiving challenges and opportunities

Solved Questions, but hard

Q: What are the HTML titles for vancouver2010com through time?

A Page scraping for all mementos

Access

Page 58: Web archiving challenges and opportunities

Impossible Questions

Q What are the anchor-text that pointed to www.vancouver2010.com through time?

Access

…<a href=www.vancouver2010.com >Vancouver Olympics</a>….

…<a href=www.vancouver2010.com >Winter Olympics</a>…

…<a href=www.vancouver2010.com >Vancouver 2010</a>…

Page 59: Web archiving challenges and opportunities

ArcLink

Access

Google code: https://code.google.com/p/arcsys/

Page 60: Web archiving challenges and opportunities

Impossible Questions • Q What are the anchor-text that pointed to

www.vancouver2010.com through time?

Access

Page 61: Web archiving challenges and opportunities

Thumbnail Summarization Techniques For Web Archives

A. AlSum, and M. L. Nelson

Submitted for publication.

Page 62: Web archiving challenges and opportunities

Thumbnails

Access

Internet Archive UK Web archive

Page 63: Web archiving challenges and opportunities

Thumbnail Creation Challenges• Scalability in Time

• IA may need 361 years to create thumbnail per each memento using one hundred machine

• Scalability in Space• IA will need 355 TB to store 1 thumbnail per each memento

• Page quality

Access

Page 64: Web archiving challenges and opportunities

How many thumbnails do we need?

Access

www.unfi.com on the live Web

Page 65: Web archiving challenges and opportunities

How many thumbnails do we need?

Access

www.unfi.com on the live Web

Page 66: Web archiving challenges and opportunities

40 Thumbnails are good.

Access

Page 67: Web archiving challenges and opportunities

Same technique applied to apple.com

Access

Page 68: Web archiving challenges and opportunities

From 8000 Mementos to 69 Thumbnails.

Access

Page 69: Web archiving challenges and opportunities

iTunes cover application

Access

Page 70: Web archiving challenges and opportunities

Community• I suggest to be a member in IIPC

• Join the open Wayback Machine team• Join the Winter Olympics 2014 collaborative project, even as an

observer

Congratulations

Page 71: Web archiving challenges and opportunities

Community• Web Archiving Workshops

WAC 2011, Ottawa, Canada

WAC 2012, Stanford, CA, USA

WADL 2013, Indianapolis, IN, USATempWeb 2013, Rio de Janeiro, Brazil

Page 72: Web archiving challenges and opportunities

Tools to SUL Web Archive• Selection

• Harvest

• Analysis

• Access

Page 73: Web archiving challenges and opportunities

Conclusions• Be Selective: Cover missing parts of the Web• Be Older: Include WebBase• Be Smart: Innovative services• Be Helpful: Researcher Framework/Dataset• Be Active: Participate in the WA communities

• Make a difference

[email protected]@aalsum

Page 74: Web archiving challenges and opportunities

BACKUP

Page 75: Web archiving challenges and opportunities

What is missing?

IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web

LoC Library of Congress BL British Library CAT Web Archive of Catalonia TWNational Taiwan University

IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It

Page 76: Web archiving challenges and opportunities

Thumbnail Features

SimHash DOM tree

Embedded resources Datetime

Page 77: Web archiving challenges and opportunities

Clustering technique

Page 78: Web archiving challenges and opportunities
Page 79: Web archiving challenges and opportunities
Page 80: Web archiving challenges and opportunities
Page 81: Web archiving challenges and opportunities

Web Archive

Web Archive

Page 82: Web archiving challenges and opportunities