web archiving and access mike smorul joseph jaja adapt group university of maryland, college park

Web Archiving and AccessWeb Archiving and Access

Mike SmorulJoseph JaJa

ADAPT GroupUniversity of Maryland, College

Park

Web Archive Storage and Web Archive Storage and SearchSearch

• Management tools

• Storage infrastructure

• Indexing, searching, compression experiments

7/21/2010 NDIIPP Partners Meeting 2

Webarc ManagerWebarc Manager

• Develop a tool to help manage webarc collections

• Show statistics of a series of crawls

• Open API to easily query collection– List all copies of a page, etc


Manager ComponentsManager Components

• WarcManager (server)– REST-based access to index– Index of DAT/ARC entries– URL Searching, ARC browsing,

• Javascript Client• Simple Web-Accessible Preservation

(SWAP)– Web-accessible distributed storage– ARC page retrieval– 1Gbps, 2200requests/s


Manager DesignManager Design


Javascript Client

Webarc Manager

MySQL

Storage Server

Storage Server

RESTJSON

Page Request

Manager ScreenshotsManager Screenshots


Search ResultsSearch ResultsURL DetailsURL DetailsWARC File DetailsWARC File Details

Storage DesignStorage Design

• SWAP – Simple, Web-Accessible Preservation• Intelligent placement of files across multiple

servers and disk partitions• Simple HTTP access, PUT, GET, DELETE• Use redirects to provide a uniform namespace• Files organized into file groups

– Each group resides on multiple partitions (slices)– Hash(file_path) % slices = partition

• No centralized catalog


How it worksHow it works


Client

Server 1 Server 2

P1 P2 P4P3

GET ‘bag1/warc55.gz’

Calculate hash(bag1/warc55.gz) % 4 = 3

Return 302: Server 2 GET ‘bag1/warc55.gz’Return warc55.gz

PerformancePerformance

• Good small file and large file performance– Over 2000 requests/s and 3000 redirects/


Time Machine for the Web

• Fast parallel indexer to handle large scale crawled web contents, coupled with a new compression scheme.

• Fast search of contents based on unstructured queries involving temporal specifications.

• Presentation of pertinent summary information in ranked order according to the temporal context.


Parallel Index ConstructionParallel Index Construction

• A parallel and hybrid strategy of multi-core CPU and multi-GPU

• Processing Speed with a single node: 80MB/s


DiskDisk

ARC File

CPU ParserCPU Parser






DiskDisk

Dictionary &Postings Lists

Parallel CPU Parser

CPU IndexerCPU Indexer

CPU IndexerCPU Indexer

GPU IndexerGPU Indexer

GPU IndexerGPU Indexer

Parallel CPU/GPU Indexer

Index CompressionIndex Compression

• A combination of – integer compression scheme and– multi-versioned document posting w/

temporal information included• Potential space savings of temporal

information for frequent terms compared to full temporal information: 50%.

• Potential space saving from integer compression: 4B vs 1B for small integers


Temporally Anchored Temporally Anchored Information RetrievalInformation Retrieval

• Given a temporally anchored query, return a ranked set of pages within a temporal context.

• A new approach that:– Substantially limits the search space– Generates efficiently temporally

ranked pages

• Extensive experimental results.7/21/2010 NDIIPP Partners Meeting 13

Additional InformationAdditional Information

• http://adapt.umiacs.umd.edu– Papers, results, etc..

• E-mail: [email protected]


http://adapt.umiacs.umd.edu/

web archiving and access mike smorul joseph jaja adapt group university of maryland, college park

Documents

ndiipp partners meetinghow

ndiipp partners meetinga

temporal context

temporal specifications

new compression scheme

web archiving

large file performanceover

file groupseach group