background

Archiving the web: does whole-of-domain archiving = information overload?

Dr Bob Pymm, Jake Wallis Charles Sturt University

Background

• PANDORA - selective web archiving by the National Library of Australia (NLA) since 1996, c. 20,000 titles

• NLA whole-of-domain (.au) web archiving annually since 2005, over 500 million files in 2007 – 19tb of data.

• Undertaken by the US based Internet Archive• Simple keyword index created plus URL index.

Issues related to the crawl

• No authorisation for gathering the files – restrictions on use

• Complexities arising from the diverse nature of the Web– Australian content on overseas servers, sites not in

the “.au” domain (eg. Blogs)– Difficulty in capturing dynamically created content– The size of the resulting dataset and indexes

Indexing and large datasets

• Remember – whenever searching Web or harvest – it is the index being searched, not the actual contents

• Thus importance of effective indexing - with research being done on how to improve

• Google’s success with ranking and weighting in its indexes

• Alternative methods include visual results sets which show links as well as straight results

Research approach

• The focus was the 2007 “.au” Web crawl and its accessibility

• Two topics from the 20/20 Summit chosen:– Indigenous health– Landcare programs

• High profile topics, likely to be of long term interest

Searching

• Searched across PANDORA and Crawl dataset• PANDORA – curated titles; multiple indexes• Crawl – keyword and URL index only• Simple search on each term – top five records

from PANDORA seen as important and authoritative due to their careful selection and indexing. Thus seen as key resources on the topic. All ten were Federal Government sites.

Searching (cont.)

• Same terms searched in Web crawl (note eventually all PANDORA sites were found in the Crawl)

PANDORA (pages returned)

Web Crawl (pages returned)

Top 5 PANDORA sites in first three pages of Crawl?

Indigenous health

64 768, 402 None

Landcare program

29 83,843 1

The Long Tail• Like the Pareto Principle, the Long Tail paradigm suggests a small

proportion of the available information meets the vast majority of needs.

Discussion

• Searchers stop after small number of pages – the top of the TAIL – then a very long tail of hits not considered

• Selective archives such as PANDORA deliver small numbers of pages of high relevance (curated), ie. The top of the tail

• The Web Crawl gives the top and long tail all together - indexing decides

Discussion (cont.)

• Cost/difficulty of creating effective indexes and display mechanisms for huge datasets

• Issue of rights and privacy infringements when no permissions sought for data harvesting

• BUT • Curated collections such as PANDORA may

reflect a conservative paradigm (top 5 sites Federal Government for instance)

Discussion (cont.)

• The Web Crawl broader, cultural milieu. • Online activities – political activism, public

communication, social networking, audio and video sharing.

• Collection development paradigm changing – mediated vs democratic – PANDORA vs Web Crawl

Conclusion

• Mediated or curated selection, as in PANDORA, delivers data of quality and integrity, readily accessible, but in very limited areas.

• The Web Crawl delivers a huge mass of data, collected without fear or favour, but very hard to access

• What will best meet needs of future researchers – has to be a continuing debate

background

Documents

pandora sites

pages of crawl

web crawl broader

tailthe web crawl

long tail paradigm

domain archiving

long tail of hits

long taillike