big data: indexing ~50tb of uris
DESCRIPTION
Presentation at the Big Data interest group at Los Alamos National Laboratory, regarding with with the IIPC to merge the indexes of multiple web archives.TRANSCRIPT
![Page 1: Big Data: Indexing ~50Tb of URIs](https://reader034.vdocument.in/reader034/viewer/2022051608/53ff9a538d7f724c088b46cc/html5/thumbnails/1.jpg)
1 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Towards Seamless Navigation of the Web of the Past
Using DISC to Reindex ~50Tb of URIs
Robert Sanderson [email protected] @azaroth42
Herbert Van de Sompel
[email protected] @hvdsomp
http://www.mementoweb.org/
![Page 2: Big Data: Indexing ~50Tb of URIs](https://reader034.vdocument.in/reader034/viewer/2022051608/53ff9a538d7f724c088b46cc/html5/thumbnails/2.jpg)
2 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Memento Project Overview Participants The Plan … … and the Reality Technical Details Next Run
Summary
![Page 3: Big Data: Indexing ~50Tb of URIs](https://reader034.vdocument.in/reader034/viewer/2022051608/53ff9a538d7f724c088b46cc/html5/thumbnails/3.jpg)
3 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Memento Project
Provide web citizens seamless access
to the web of the past Some accolades: • $1M funding from the Library of Congress • Winner of 2010 International Digital Preservation Award • Accepted by International Internet Preservation Consortium
as the way forwards for access to web archives • Nominated for Best Paper at JCDL 2010 • Best Poster at JCDL 2012 • Tim Berners-Lee @ LDOW10: “We absolutely need this” • Internet Draft 6 (final call) by May
“ ”
![Page 4: Big Data: Indexing ~50Tb of URIs](https://reader034.vdocument.in/reader034/viewer/2022051608/53ff9a538d7f724c088b46cc/html5/thumbnails/4.jpg)
4 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Old Copies have New Names Everyone knows the URL for CNN is: http://www.cnn.com/ Everyone knows the URL for CNN was: http://www.cnn.com/
![Page 5: Big Data: Indexing ~50Tb of URIs](https://reader034.vdocument.in/reader034/viewer/2022051608/53ff9a538d7f724c088b46cc/html5/thumbnails/5.jpg)
5 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Old Copies have New Names Copies of the past versions of to http://www.cnn.com/ exist … but you don’t go there to get them, they have new URLs … and there’s no way to automatically discover those new URLs
http://web.archive.org/web/20031013111028/http://www2.cnn.com/
http://web.archive.org/web/20031013111028/http://www2.cnn.com/
![Page 6: Big Data: Indexing ~50Tb of URIs](https://reader034.vdocument.in/reader034/viewer/2022051608/53ff9a538d7f724c088b46cc/html5/thumbnails/6.jpg)
6 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Discovery is Hard People want to find, not search … and especially not searching by hand!
http://web.archive.org/web/*/http://www2.cnn.com/
![Page 7: Big Data: Indexing ~50Tb of URIs](https://reader034.vdocument.in/reader034/viewer/2022051608/53ff9a538d7f724c088b46cc/html5/thumbnails/7.jpg)
7 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Navigating in the Past Links to the real resource bring you back out of the past … or trap you in a single incomplete archive.
http://web.archive.org/web/*/http://www2.cnn.com/
Dec 20 2001, 4:51:00 UTC current
Pentagon
![Page 8: Big Data: Indexing ~50Tb of URIs](https://reader034.vdocument.in/reader034/viewer/2022051608/53ff9a538d7f724c088b46cc/html5/thumbnails/8.jpg)
8 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Memento: Time Travel for the Web TimeGate: Resource that knows the locations of archived copies Memento: An archived copy of previous state of a web resource
Link header to TimeGate
302 redirect to Memento
How does the TimeGate know where the appropriate Memento is?
![Page 9: Big Data: Indexing ~50Tb of URIs](https://reader034.vdocument.in/reader034/viewer/2022051608/53ff9a538d7f724c088b46cc/html5/thumbnails/9.jpg)
9 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Project Overview
Goal: • To aggregate the metadata of the distributed archives
of the IIPC, and • To provide Memento based access to the holdings
of open archives • To provide knowledge of the holdings of restricted
archives • To provide knowledge to IIPC members of the
holdings of totally closed archives
![Page 10: Big Data: Indexing ~50Tb of URIs](https://reader034.vdocument.in/reader034/viewer/2022051608/53ff9a538d7f724c088b46cc/html5/thumbnails/10.jpg)
10 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Experiment Participants
• Austrian National Library • Bibliothèque Nationale de France • British Library • Institut National de l'Audiovisuel • Internet Archive • Koninklijke Bibliotheek • Library of Congress • Netarchive.dk • Swiss National Library • University of North Texas
• Los Alamos National Laboratory • Old Dominion University
![Page 11: Big Data: Indexing ~50Tb of URIs](https://reader034.vdocument.in/reader034/viewer/2022051608/53ff9a538d7f724c088b46cc/html5/thumbnails/11.jpg)
11 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Data
kosovakosovo.com/photo.php?id=5785
20090608161553
http://www.kosovakosovo.com/photo.php?id=5785
text/html
200
M36MRHSBVPLKMUN6PFOIEV3AH5ADITAN
-
563096
AIT-1068-20090608161511.warc.gz
Canonical URL
Datestamp
Request URL
MIME type HTTP Status Checksum
Redirect?
Bytes
Storage File
Multiplied by 6Tb compressed, ~50Tb uncompressed
![Page 12: Big Data: Indexing ~50Tb of URIs](https://reader034.vdocument.in/reader034/viewer/2022051608/53ff9a538d7f724c088b46cc/html5/thumbnails/12.jpg)
12 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
The Plan…
• To provide fast access to distributed archives, LANL would merge the indexes of the holdings of multiple archives and provide Memento based access
• Step 1: Library of Congress gathers CDX files Step 2: LANL indexes (a.k.a. “…”) Step 3: Profit
• Data: 6T of gzipped CDX files (mostly from IA)
• Shipped on hard drives • Computing: 210 node DISC cluster at LANL
• 2x 2ghz processors, 2x 2T HDD, 8G RAM
![Page 13: Big Data: Indexing ~50Tb of URIs](https://reader034.vdocument.in/reader034/viewer/2022051608/53ff9a538d7f724c088b46cc/html5/thumbnails/13.jpg)
13 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
… and the Reality
• Hardware failure killed one of the drives en route • Transferred remaining files via BagIt from LoC
• DISC has restricted access: • Had to transfer data over intranet • 2 weeks to sync (5Mb/sec) • And then 2 weeks to get the processed results off
• Compute cluster has faulty switch, unreliable nodes: • Ran original processing 15 times without success
due to hardware failures
![Page 14: Big Data: Indexing ~50Tb of URIs](https://reader034.vdocument.in/reader034/viewer/2022051608/53ff9a538d7f724c088b46cc/html5/thumbnails/14.jpg)
14 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Processing Design
• For each CDX file, • For each URI + timestamps,
• Map URI to an appropriate database slice • Merge timestamps with those of previous CDXs
• Possible because: • No need to do truncated search • No need to walk through URIs in order • No need for time based access, only URI
• Problem is “Embarrassingly Parallel”
![Page 15: Big Data: Indexing ~50Tb of URIs](https://reader034.vdocument.in/reader034/viewer/2022051608/53ff9a538d7f724c088b46cc/html5/thumbnails/15.jpg)
15 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Approach 1: Online Messaging
• 25 read nodes, 1 control node, 150 write nodes • Messages (1000 URIs) sent via control node to write
• Failed 15 times due to hardware issues
![Page 16: Big Data: Indexing ~50Tb of URIs](https://reader034.vdocument.in/reader034/viewer/2022051608/53ff9a538d7f724c088b46cc/html5/thumbnails/16.jpg)
16 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Approach 1: Implementation
• Python
• MPI L • PyMPI LL • No failure detection, no auto restart, no zombie killing,
no … LLL • Several months of experiments to find issues, fine tune
parameters most likely to complete…
![Page 17: Big Data: Indexing ~50Tb of URIs](https://reader034.vdocument.in/reader034/viewer/2022051608/53ff9a538d7f724c088b46cc/html5/thumbnails/17.jpg)
17 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Approach 2: No Interaction
• 43 read/split nodes • Phase 1: Read nodes split CDX files to 3000 slices
![Page 18: Big Data: Indexing ~50Tb of URIs](https://reader034.vdocument.in/reader034/viewer/2022051608/53ff9a538d7f724c088b46cc/html5/thumbnails/18.jpg)
18 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Approach 2: No Interaction
• Phase 2a: Transfer CDX slices to Control node • Phase 2b: Transfer CDX slices to Write nodes
![Page 19: Big Data: Indexing ~50Tb of URIs](https://reader034.vdocument.in/reader034/viewer/2022051608/53ff9a538d7f724c088b46cc/html5/thumbnails/19.jpg)
19 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Approach 2: No Interaction
• 50 write nodes (* 60 slices each = 3000 slices) • Phase 3: Merge slices from nodes to BerkeleyDBs
![Page 20: Big Data: Indexing ~50Tb of URIs](https://reader034.vdocument.in/reader034/viewer/2022051608/53ff9a538d7f724c088b46cc/html5/thumbnails/20.jpg)
20 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Next Steps
• Re-index, using non-interactive approach
• New data: 14 Tb (~120Tb uncompressed?) • Use FileMap for better automation?
• Two eSata 8T LaCie cubes, 2 3T internal drives to install locally
• More intelligent data partitioning • Some sort of error detection and handling!
![Page 21: Big Data: Indexing ~50Tb of URIs](https://reader034.vdocument.in/reader034/viewer/2022051608/53ff9a538d7f724c088b46cc/html5/thumbnails/21.jpg)
21 Big Data Interest Group, LANL, Feb 21st 2013 [Unclassified]
Memento
http://mementoweb.org/
Robert Sanderson [email protected]
@azaroth42
Herbert Van de Sompel [email protected]
@hvdsomp