australian web domain harvests 2005, 2006 & 2007
TRANSCRIPT
Australian web domain harvests2005, 2006 & 2007
Unique Hosts Collected
811,523
1,260,533 1,247,614
42,93610,037
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1,400,000
AusCrawl 2005 AusCrawl 2006 AusCrawl 2007 PANDORA all PANDORA (.au)
Igor RanitovicInternet Archive engineerWith Petabox rackFor Australian domain harvest
PANDORA : Domain Harvesting
• Australian domain harvest– .au domain, located on Australian servers– Internet Archive
• 1st harvest June/July 2005 – 4 weeks, 185m files, 6.69 TBs
• 2nd harvest Aug/Sept 2006– 5 weeks, 596m files, 19.04 TBs
• 3rd harvest Aug/Sept 2007– 4 weeks, 516m files, 18.47 TBs
Comparative statistics
PANDORA
Files: 51 million
Size: 2.12 TB
Domain Harvest 2005 2006 2007
Unique files 185,549,662 596,238,990 516,064,820
Hosts crawled 811,523 1,046,038 1,247,614
Size 6.69 TB 19.04 18.47 TB
Domain Harvests
Files: 1,297 million
Size: 44.2 TB
PANDORA : Domain Harvesting
Size in Terabytes
1.73
6.69
19.04
18.47
PANDORA
AusCraw l05
AusCraw l06
AusCraw l07
PANDORA : Domain Harvesting
• Some pros – – Retains linkages and context– Large scale – more bytes for the buck– Less selectively discriminate
• Some cons – – High dependence on the crawler technology– Domain and geo-location bias (.au, geoIP)– Limitations in timeliness, quality assurance,
scoping, site complexity, deep web– Legal and access issues to resolve
PANDORA : Australia’s Web Archive
• Enormous growth and volume of material• Everyone can be creators and publishers• Virtually instantaneous publication• Dynamic content and format• Multiplicity of formats• Technology dependent • Hyperlinked and interconnected• Highly accessible but hard to identify• Ephemeral• Interactivity, re-use, personalisation (web 2.0)