mobile web crawling master thesis defense jan fiedler 04/17/98

Mobile Web Crawling

Master Thesis Defense

Jan Fiedler

04/17/98

04/17/98 [email protected] 2

Presentation Outline

• Resource Discovery Problem• Web Crawling Techniques

– Traditional Web Crawling– Mobile Web Crawling

• Mobile Crawling Architecture– Distributed Runtime Environment– Application Framework– Performance Evaluation

• Summary and Conclusion


Resource Discovery Problem

• Web establishes large distributed hypertext system– 1.6 million Web sites

– 320 million Web documents

– 40% of the Web content changes within a month

– exponential growing rate

– lack of structure (i.e. no strict hierarchy)

Goal: overlay the distributed Web structure with a centralized information system which allows resource discovery


Web Indices and Search Engines

• Search engine statistics:– index size 30-110 million pages (approx. 700GB)

– web coverage 10%-35%

– daily crawl 3-10 million pages (approx. 60GB)

• Year 2000 estimates:– index size 880 million pages (approx. 5.6TB)

– daily crawl 80 million pages (approx. 480GB)

Traditional Web crawling will experience severe scaling problems in the near future.


Traditional Crawling Overview Google domain

LAN

Web

Repository

URLServer

IndexerAnchorsURL

Resolver

Crawler

Crawler

Crawler

Crawler

HTTP

StoreServer


Traditional Web Crawling

• Characteristics of traditional Web crawling:– remote data access

– focus on rapid data retrieval

– centralized, database oriented architecture

– brute force download of Web content

– resource intensive approach

Traditional Web crawling techniques do not exploit information about the pages being crawled in order to reduce the crawling costs.


Mobile Crawling Overview

Search Engine

Remote Host

HTTPServer

Web

Remote Host

HTTPServer

Remote Host

HTTPServer

Index

Crawler Manager


Mobile Web Crawling

• Characteristics of mobile Web crawling:– local data access

– focus on effective data retrieval

– distributed, data source oriented architecture

– intelligent download of significant Web content

– resource preserving approach

Mobility allows a Web crawler to analyze Web pages before investing Web resources for their transmission


Mobile Crawling Advantages

• Remote page selection– determine significance of a page prior to transmission

– applicable for specialized search engines

• Remote page filtering– use effective page representation model

– applicable for non-fulltext search engines

• Remote page compression– compress page data prior to transmission

– applicable for all search engines


Crawler Specification

• Rule based programming paradigm– represent crawler data as facts (e.g. page-facts)

– describe crawler behavior as a set rules which operate upon facts

• Advantages– it is easier to specify crawling rules than to devise a

crawling algorithm

– no need to model control flow

– rule based programs have very simple runtime states


Mobile Crawling Architecture

Application Framework Architecture

Distributed Crawler Runtime Environment

DatabaseCommand Manager

DB

ConnectionManager

SQ

L

Crawler ManagerCrawlerSpec

CommunicationSubsystem

Outbox Inbox

QueryEngine

Archive Manager


VirtualMachine

HTTPServer

Net


VirtualMachine

HTTPServer


VirtualMachine

HTTPServer


VirtualMachine

HTTPServer


Mobile Crawling Architecture

• Distributed Crawler Runtime Environment– provide platform independent execution environment

– virtual machine for remote crawler execution

– communication layer for crawler migration

• Application Framework– support for crawler specification and configuration

– crawler manager for crawler specification

– query engine as crawler/application interface

– archive manager as database connectivity framework


Crawler Virtual Machine

• How to execute a rule based crawler specification?– crawler execution = rule application upon fact base

– use inference engine for the the rule application process

1. Initialization• insert rules and facts into inference engine

2. Rule application• start rule application process within inference engine

3. Finalization• extract rules and facts once the rule application stopped


Crawler Virtual MachineVirtual Machine

Communication Layer

Scheduling

ExecutionThread

InferenceEngine

ExecutionThread

InferenceEngine

ExecutionThread

InferenceEngine

ExecutionThread

InferenceEngine


Crawler Query Engine

• How to access the crawler knowledge?– provide a query facility to query the crawler fact base

– implement a SQL subset as query language

– represent query result as data tuples, not as facts

– allows the user to reason about crawling results

– query engine implementation uses inference engine

Query engine serves as the primary interface between the user application and the mobile crawler


Crawler Query EngineCrawler Object

Query Engine

Crawler Facts

UserQuery

QueryCompiler

Query Rule

Crawler FactsCrawler Facts

Crawler FactsCrawler FactsCrawler Facts

Crawler FactsCrawler FactsCrawler Rules

Crawler FactsCrawler FactsResult Tuples

Inference Engine


Performance Evaluation Setup

• Use distributed virtual machines to support mobile as well as traditional Web crawling

REM OT E L OC A L

Craw lerManager

Communic ationSubs y s tem

Craw lerSpec

V ir tualMac hine


HTMLHTTPServ er

V ir tualMac hine



Performance Evaluation

• Controlled environment setup– static HTML data set with known properties

– personal HTTP server

– unshared communication channel (dialup line)

• Measurements1. network load for traditional (stationary) crawler

2. network load for mobile crawler without page compression

3. network load for mobile crawler with page compression


Benefit of Remote Page Selection

0

50

100

150

200

250

300

350

400

450

S1 M1 M2 M3 M4

Tota

l loa

d (K

B)

uncompressed

compressed

Traditional crawler (S1) versus mobile crawlers (M1-M4) with different keyword sets for page selection


Benefit of Remote Page Filtering Mobile crawler (M1) with a decreasing degree of page

filtering (10%-90% page data preserved)

0%

20%

40%

60%

80%

100%

120%

90% 80% 70% 60% 50% 40% 30% 20% 10%

Filter degree

Net

wor

k lo

ad

Load uncompressed Load compressed


Benefit of Page Compression Traditional crawler (S1) and mobile crawler (M1) with an

increasing number of crawled pages

0

100

200300

400

500

600700

800

900

1 10 22 51 82 158

Retrieved pages

Tota

l loa

d (in

KB

)

Stationary Mobile uncompressed Mobile compressed


Costs and Benefits

• Overhead– overhead due to crawler migration (<5K)

– overhead due to facts based data representation (6%)

• Benefits without page compression– as soon as less than 85% per page needs to be preserved

– as soon as less than 90% of all pages are transmitted

• Benefits with page compression– reduction in network load by a factor of 4.5


Summary and Conclusion

• Mobile crawling advantages:– approach fits better in distributed web environment

– approach beneficial for all types of search engines

– better support for specialized search engines

– network overhead due to crawler mobility is small

Mobile crawling solves the scaling problems of the traditional crawling approach by allowing remote operations to be performed on the crawled data.

Approach provides a base for smart Web crawling.


Future Work

• Security– crawler identification based on digital signatures

– restrict crawler execution to positive identified crawlers

– implement virtual machine as a secure sandbox

• Crawler mobility support– integrate virtual machine into web servers

• Mobile crawling algorithms– optimize crawling algorithms with crawler mobility in

mind (e.g. crawler communication)

mobile web crawling master thesis defense jan fiedler 04/17/98

Documents

web crawler

web pages

gbtraditional web crawling

web documents40

web resources

web sites320

crawling rules

crawling overview