australian web domain harvests 2005, 2006 & 2007

Australian web domain harvests2005, 2006 & 2007

Unique Hosts Collected

811,523

1,260,533 1,247,614

42,93610,037

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1,400,000

AusCrawl 2005 AusCrawl 2006 AusCrawl 2007 PANDORA all PANDORA (.au)

Igor RanitovicInternet Archive engineerWith Petabox rackFor Australian domain harvest

PANDORA : Domain Harvesting

• Australian domain harvest– .au domain, located on Australian servers– Internet Archive

• 1st harvest June/July 2005 – 4 weeks, 185m files, 6.69 TBs

• 2nd harvest Aug/Sept 2006– 5 weeks, 596m files, 19.04 TBs

• 3rd harvest Aug/Sept 2007– 4 weeks, 516m files, 18.47 TBs

Comparative statistics

PANDORA

Files: 51 million

Size: 2.12 TB

Domain Harvest 2005 2006 2007

Unique files 185,549,662 596,238,990 516,064,820

Hosts crawled 811,523 1,046,038 1,247,614

Size 6.69 TB 19.04 18.47 TB

Domain Harvests

Files: 1,297 million

Size: 44.2 TB


Size in Terabytes

1.73

6.69

19.04

18.47

PANDORA

AusCraw l05

AusCraw l06

AusCraw l07


• Some pros – – Retains linkages and context– Large scale – more bytes for the buck– Less selectively discriminate

• Some cons – – High dependence on the crawler technology– Domain and geo-location bias (.au, geoIP)– Limitations in timeliness, quality assurance,

scoping, site complexity, deep web– Legal and access issues to resolve

PANDORA : Australia’s Web Archive

• Enormous growth and volume of material• Everyone can be creators and publishers• Virtually instantaneous publication• Dynamic content and format• Multiplicity of formats• Technology dependent • Hyperlinked and interconnected• Highly accessible but hard to identify• Ephemeral• Interactivity, re-use, personalisation (web 2.0)

australian web domain harvests 2005, 2006 & 2007

Documents