big data: indexing ~50tb of uris

1 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]

Towards Seamless Navigation of the Web of the Past

Using DISC to Reindex ~50Tb of URIs

Robert Sanderson [email protected] @azaroth42

Herbert Van de Sompel

[email protected] @hvdsomp

http://www.mementoweb.org/


Memento Project Overview Participants The Plan … … and the Reality Technical Details Next Run

Summary


Memento Project

Provide web citizens seamless access

to the web of the past Some accolades: •  $1M funding from the Library of Congress •  Winner of 2010 International Digital Preservation Award •  Accepted by International Internet Preservation Consortium

as the way forwards for access to web archives •  Nominated for Best Paper at JCDL 2010 •  Best Poster at JCDL 2012 •  Tim Berners-Lee @ LDOW10: “We absolutely need this” •  Internet Draft 6 (final call) by May

“ ”


Old Copies have New Names Everyone knows the URL for CNN is: http://www.cnn.com/ Everyone knows the URL for CNN was: http://www.cnn.com/


Old Copies have New Names Copies of the past versions of to http://www.cnn.com/ exist … but you don’t go there to get them, they have new URLs … and there’s no way to automatically discover those new URLs

http://web.archive.org/web/20031013111028/http://www2.cnn.com/

http://web.archive.org/web/20031013111028/http://www2.cnn.com/


Discovery is Hard People want to find, not search … and especially not searching by hand!

http://web.archive.org/web/*/http://www2.cnn.com/


Navigating in the Past Links to the real resource bring you back out of the past … or trap you in a single incomplete archive.

http://web.archive.org/web/*/http://www2.cnn.com/

Dec 20 2001, 4:51:00 UTC current

Pentagon


Memento: Time Travel for the Web TimeGate: Resource that knows the locations of archived copies Memento: An archived copy of previous state of a web resource

Link header to TimeGate

302 redirect to Memento

How does the TimeGate know where the appropriate Memento is?


Project Overview

Goal: •  To aggregate the metadata of the distributed archives

of the IIPC, and •  To provide Memento based access to the holdings

of open archives •  To provide knowledge of the holdings of restricted

archives •  To provide knowledge to IIPC members of the

holdings of totally closed archives


Experiment Participants

•  Austrian National Library •  Bibliothèque Nationale de France • British Library • Institut National de l'Audiovisuel • Internet Archive • Koninklijke Bibliotheek • Library of Congress •  Netarchive.dk •  Swiss National Library •  University of North Texas

• Los Alamos National Laboratory • Old Dominion University


Data

kosovakosovo.com/photo.php?id=5785

20090608161553

http://www.kosovakosovo.com/photo.php?id=5785

text/html

200

M36MRHSBVPLKMUN6PFOIEV3AH5ADITAN

-

563096

AIT-1068-20090608161511.warc.gz

Canonical URL

Datestamp

Request URL

MIME type HTTP Status Checksum

Redirect?

Bytes

Storage File

Multiplied by 6Tb compressed, ~50Tb uncompressed


The Plan…

•  To provide fast access to distributed archives, LANL would merge the indexes of the holdings of multiple archives and provide Memento based access

•  Step 1: Library of Congress gathers CDX files Step 2: LANL indexes (a.k.a. “…”) Step 3: Profit

•  Data: 6T of gzipped CDX files (mostly from IA)

•  Shipped on hard drives •  Computing: 210 node DISC cluster at LANL

•  2x 2ghz processors, 2x 2T HDD, 8G RAM


… and the Reality

•  Hardware failure killed one of the drives en route •  Transferred remaining files via BagIt from LoC

•  DISC has restricted access: •  Had to transfer data over intranet •  2 weeks to sync (5Mb/sec) •  And then 2 weeks to get the processed results off

•  Compute cluster has faulty switch, unreliable nodes: •  Ran original processing 15 times without success

due to hardware failures


Processing Design

•  For each CDX file, •  For each URI + timestamps,

•  Map URI to an appropriate database slice •  Merge timestamps with those of previous CDXs

•  Possible because: •  No need to do truncated search •  No need to walk through URIs in order •  No need for time based access, only URI

•  Problem is “Embarrassingly Parallel”


Approach 1: Online Messaging

•  25 read nodes, 1 control node, 150 write nodes •  Messages (1000 URIs) sent via control node to write

•  Failed 15 times due to hardware issues


Approach 1: Implementation

•  Python

•  MPI L •  PyMPI LL •  No failure detection, no auto restart, no zombie killing,

no … LLL •  Several months of experiments to find issues, fine tune

parameters most likely to complete…


Approach 2: No Interaction

•  43 read/split nodes •  Phase 1: Read nodes split CDX files to 3000 slices



•  Phase 2a: Transfer CDX slices to Control node •  Phase 2b: Transfer CDX slices to Write nodes



•  50 write nodes (* 60 slices each = 3000 slices) •  Phase 3: Merge slices from nodes to BerkeleyDBs


Next Steps

•  Re-index, using non-interactive approach

•  New data: 14 Tb (~120Tb uncompressed?) •  Use FileMap for better automation?

•  Two eSata 8T LaCie cubes, 2 3T internal drives to install locally

•  More intelligent data partitioning •  Some sort of error detection and handling!


Memento

http://mementoweb.org/

Robert Sanderson [email protected]

@azaroth42

Herbert Van de Sompel [email protected]

@hvdsomp

big data: indexing ~50tb of uris

Technology

transfer cdx

provide knowledge

distributed

orgwebhttp

web

copies

url

memento