archive-it architecture introduction april 18, 2006 dan avery internet archive 1

16
Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1

Upload: kevin-farmer

Post on 23-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1

Archive-It Architecture Introduction

April 18, 2006Dan Avery

Internet Archive

1

Page 2: Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1

Archive-It Components

•Crawling

•User Interface

•Storage

•Playback

•Text Indexing

•Integration

2

Page 3: Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1

Component Integration

3

Page 4: Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1

Crawling

•Heritrix ( http://crawler.archive.org/ )

•Java application

•Open source (LGPL)

•Crawls for completeness/depth

•Highly configurable

4

Page 5: Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1

Crawling - Distributed Crawling•Heritrix Cluster Controller

•Java component - open source - developed by IA

•http://crawler.archive.org/hcc

•Provides proxy access to pool of Heritrix instances through JMX interface

•Provides crawler control and status

•Currently controlling 33 crawler instances on three commodity dual Opterons--upper bound unknown

5

Page 6: Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1

Archive-It Web Application

• User Interface and Crawl Scheduling

• Gets seed URLs and crawl parameters from users

• Schedules new periodic crawls

• Talks to crawler pool through HCC

• Provides access, search, and crawl history UI 6

Page 7: Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1

Storage

•archive.org ARC repository

•custom Perl system

•simple storage on primary/backup pairs

•monthly MD5 digest verification

•robust, non proprietary file format

•Alexandria (Egypt)/Amsterdam

7

Page 8: Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1

Access• Internet Archive Wayback

Machine

• Replaying archived web pages since 2001

• Current IA version written in Perl and C, with components distributed across various machines

• Not open source, but open source beta (in Java) available now

8

Page 9: Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1

Full-Text Indexing

•Nutch (http://nutch.org)

•NutchWAX (http://archive-access.sf.net) additions create and search indexes of stored ARC files

•Standard text search plus link analysis

•can search by date instead of relevance, useful for individual archives

9

Page 10: Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1

Text Indexing Challenges

•Some parts are distributable, some are not

•Incremental indexing - goal of new crawls in index within 72 hours

•Working on Archive-It usable map/reduce version - July

•In the meantime, a lot of workarounds

10

Page 11: Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1

Integration

•Group of Perl and bash scripts - planning more complex than the execution

•Most components available individually

•Decentralized control, centralized monitoring

•Each component operates almost entirely independently

11

Page 12: Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1

The Big Picture

12

Page 13: Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1

Future Challenges•Crawler trap detection

•Scalability

•Current setup can accommodate 300 partners at current crawling rates

•During pilot we crawled/indexed/stored just over 100,000,000 documents (~4TB) in eight weeks

•More machines can be easily added to storage and crawling clusters

13

Page 14: Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1

Scalability

•Current Nutch is between versions

•Old version has some non-distributable pieces

•New version is much more distributable and scalable (map/reduce - Hadoop), but not ready for incremental indexing

14

Page 15: Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1

Looking ahead•After basic UI/archiving/indexing...

•Time-based search UI

•Analyzing archives for research and ongoing collection improvement

•Content classification

•Rate of change

•New site suggestions

15

Page 16: Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1

http://www.archive-it.org16