wire, an open-source information retrieval environment (oswir 2005 compiegne)

Outline WIRE Project Web Crawler Conclusions

WIRE: an Open Source Web Information

Retrieval Environment

Carlos Castillo and Ricardo Baeza-Yates

Center for Web Researchhttp://www.cwr.cl/

CS Dept., University of Chile

OSWIR 2005Compiegne, FranceSeptember 19, 2005

Carlos Castillo and Ricardo Baeza-Yates Center for Web Research

WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/

1 WIRE Project

2 Web Crawler

3 Conclusions

Motivation

Study subsets of the Web (1-50 million pages)

V We want high performance

V We want to keep as much data as possible

V We want to study scheduling algorithms

X wget is not enough

X Large-scale crawlers were not publicly available

Motivation

General Architecture

Crawling Collection

Text Index

Importing

Statistics

Text Search

Clustering

XML Index XML Search

Classification

Focused Crawling

Extracting

Characteristics

b Roughly 25,000 lines of open-source C/C++ code

L Asynchronous DNS and HTTP requests, small memoryand processing requirements (except during the analysis)

V Highly configurable: rate of download, parser parameters,scheduling policy, etc.

Characteristics

Web Crawler

GathererParsing

Link extraction

HarvesterShort-term scheduling

Network transfers

SeederLink resolving

Robots exclusions

ManagerPage score calculationsLong-term scheduling

Collection

Scheduling

quality 0.4freshness 0.1visited? 1

quality 0.7freshness 0.9visited? 1

quality 0.6freshness -visited? 0

0.04 Profit: 0.36

Profit: 0.07

Profit: 0.6

FutureValue

CurrentValue

Profit=

Downloading pages

World Wide Web

S1 S2 S3 S4 S5 S6 S7

Web sites

Web pages

Storing contents

Document

hash( )

Content seen?

Free space list

Disk Storage

URL parsing

host.domain.com 235

235 path/file.html 9421

h1('host.domain.com')

h2('235 dir/file.html')

SITE-ID = 235; DOC-ID = 9421

http://host.domain.com/dir/file.html

Practical problems

Z The devil is in the details

§ Varying quality of service

§ Wrong DNS records, temporary DNS failures

§ HTTP responses without headers, with wrong headers,dates

§ HTML parsing has to be very tolerant

§ Duplicate pages, session-ids, etc.

Practical problems

Data analysis

b Includes link analysis and extraction of statistics (data isexported as .csv files)

b Reports are generated using LATEXand gnuplot

b Report about documents: histograms of size, in- andout-degree, link scores, page depth, HTTP responses,age, media types, etc.

b Report about sites: degree distribution in the hostgraph,maximum depth, pages per site, link structure, etc.

Data analysis

Conclusions

V A tool for Web characterization studies

V Can be extended for other purposes

V Code and documentation available athttp://www.cwr.cl/projects/

Thank you.

Conclusions

Thank you.

Conclusions

Thank you.

Conclusions

Thank you.

Conclusions

Thank you.

wire, an open-source information retrieval environment (oswir 2005 compiegne)

Technology

web research wire

web researchhttp

ricardo baezayatescenter

pages v

possible v

analysis v

available carlos castillo

high performance v