mobile web crawling master thesis defense jan fiedler 04/17/98
TRANSCRIPT
Mobile Web Crawling
Master Thesis Defense
Jan Fiedler
04/17/98
04/17/98 [email protected] 2
Presentation Outline
• Resource Discovery Problem• Web Crawling Techniques
– Traditional Web Crawling– Mobile Web Crawling
• Mobile Crawling Architecture– Distributed Runtime Environment– Application Framework– Performance Evaluation
• Summary and Conclusion
04/17/98 [email protected] 3
Resource Discovery Problem
• Web establishes large distributed hypertext system– 1.6 million Web sites
– 320 million Web documents
– 40% of the Web content changes within a month
– exponential growing rate
– lack of structure (i.e. no strict hierarchy)
Goal: overlay the distributed Web structure with a centralized information system which allows resource discovery
04/17/98 [email protected] 4
Web Indices and Search Engines
• Search engine statistics:– index size 30-110 million pages (approx. 700GB)
– web coverage 10%-35%
– daily crawl 3-10 million pages (approx. 60GB)
• Year 2000 estimates:– index size 880 million pages (approx. 5.6TB)
– daily crawl 80 million pages (approx. 480GB)
Traditional Web crawling will experience severe scaling problems in the near future.
04/17/98 [email protected] 5
Traditional Crawling Overview Google domain
LAN
Web
Repository
URLServer
IndexerAnchorsURL
Resolver
Crawler
Crawler
Crawler
Crawler
HTTP
StoreServer
04/17/98 [email protected] 6
Traditional Web Crawling
• Characteristics of traditional Web crawling:– remote data access
– focus on rapid data retrieval
– centralized, database oriented architecture
– brute force download of Web content
– resource intensive approach
Traditional Web crawling techniques do not exploit information about the pages being crawled in order to reduce the crawling costs.
04/17/98 [email protected] 7
Mobile Crawling Overview
Search Engine
Remote Host
HTTPServer
Web
Remote Host
HTTPServer
Remote Host
HTTPServer
Index
Crawler Manager
04/17/98 [email protected] 8
Mobile Web Crawling
• Characteristics of mobile Web crawling:– local data access
– focus on effective data retrieval
– distributed, data source oriented architecture
– intelligent download of significant Web content
– resource preserving approach
Mobility allows a Web crawler to analyze Web pages before investing Web resources for their transmission
04/17/98 [email protected] 9
Mobile Crawling Advantages
• Remote page selection– determine significance of a page prior to transmission
– applicable for specialized search engines
• Remote page filtering– use effective page representation model
– applicable for non-fulltext search engines
• Remote page compression– compress page data prior to transmission
– applicable for all search engines
04/17/98 [email protected] 10
Crawler Specification
• Rule based programming paradigm– represent crawler data as facts (e.g. page-facts)
– describe crawler behavior as a set rules which operate upon facts
• Advantages– it is easier to specify crawling rules than to devise a
crawling algorithm
– no need to model control flow
– rule based programs have very simple runtime states
04/17/98 [email protected] 11
Mobile Crawling Architecture
Application Framework Architecture
Distributed Crawler Runtime Environment
DatabaseCommand Manager
DB
ConnectionManager
SQ
L
Crawler ManagerCrawlerSpec
CommunicationSubsystem
Outbox Inbox
QueryEngine
Archive Manager
CommunicationSubsystem
VirtualMachine
HTTPServer
Net
CommunicationSubsystem
VirtualMachine
HTTPServer
CommunicationSubsystem
VirtualMachine
HTTPServer
CommunicationSubsystem
VirtualMachine
HTTPServer
04/17/98 [email protected] 12
Mobile Crawling Architecture
• Distributed Crawler Runtime Environment– provide platform independent execution environment
– virtual machine for remote crawler execution
– communication layer for crawler migration
• Application Framework– support for crawler specification and configuration
– crawler manager for crawler specification
– query engine as crawler/application interface
– archive manager as database connectivity framework
04/17/98 [email protected] 13
Crawler Virtual Machine
• How to execute a rule based crawler specification?– crawler execution = rule application upon fact base
– use inference engine for the the rule application process
1. Initialization• insert rules and facts into inference engine
2. Rule application• start rule application process within inference engine
3. Finalization• extract rules and facts once the rule application stopped
04/17/98 [email protected] 14
Crawler Virtual MachineVirtual Machine
Communication Layer
Scheduling
ExecutionThread
InferenceEngine
ExecutionThread
InferenceEngine
ExecutionThread
InferenceEngine
ExecutionThread
InferenceEngine
04/17/98 [email protected] 15
Crawler Query Engine
• How to access the crawler knowledge?– provide a query facility to query the crawler fact base
– implement a SQL subset as query language
– represent query result as data tuples, not as facts
– allows the user to reason about crawling results
– query engine implementation uses inference engine
Query engine serves as the primary interface between the user application and the mobile crawler
04/17/98 [email protected] 16
Crawler Query EngineCrawler Object
Query Engine
Crawler Facts
UserQuery
QueryCompiler
Query Rule
Crawler FactsCrawler Facts
Crawler FactsCrawler FactsCrawler Facts
Crawler FactsCrawler FactsCrawler Rules
Crawler FactsCrawler FactsResult Tuples
Inference Engine
04/17/98 [email protected] 17
Performance Evaluation Setup
• Use distributed virtual machines to support mobile as well as traditional Web crawling
REM OT E L OC A L
Craw lerManager
Communic ationSubs y s tem
Craw lerSpec
V ir tualMac hine
Communic ationSubs y s tem
HTMLHTTPServ er
V ir tualMac hine
Communic ationSubs y s tem
04/17/98 [email protected] 18
Performance Evaluation
• Controlled environment setup– static HTML data set with known properties
– personal HTTP server
– unshared communication channel (dialup line)
• Measurements1. network load for traditional (stationary) crawler
2. network load for mobile crawler without page compression
3. network load for mobile crawler with page compression
04/17/98 [email protected] 19
Benefit of Remote Page Selection
0
50
100
150
200
250
300
350
400
450
S1 M1 M2 M3 M4
Tota
l loa
d (K
B)
uncompressed
compressed
Traditional crawler (S1) versus mobile crawlers (M1-M4) with different keyword sets for page selection
04/17/98 [email protected] 20
Benefit of Remote Page Filtering Mobile crawler (M1) with a decreasing degree of page
filtering (10%-90% page data preserved)
0%
20%
40%
60%
80%
100%
120%
90% 80% 70% 60% 50% 40% 30% 20% 10%
Filter degree
Net
wor
k lo
ad
Load uncompressed Load compressed
04/17/98 [email protected] 21
Benefit of Page Compression Traditional crawler (S1) and mobile crawler (M1) with an
increasing number of crawled pages
0
100
200300
400
500
600700
800
900
1 10 22 51 82 158
Retrieved pages
Tota
l loa
d (in
KB
)
Stationary Mobile uncompressed Mobile compressed
04/17/98 [email protected] 22
Costs and Benefits
• Overhead– overhead due to crawler migration (<5K)
– overhead due to facts based data representation (6%)
• Benefits without page compression– as soon as less than 85% per page needs to be preserved
– as soon as less than 90% of all pages are transmitted
• Benefits with page compression– reduction in network load by a factor of 4.5
04/17/98 [email protected] 23
Summary and Conclusion
• Mobile crawling advantages:– approach fits better in distributed web environment
– approach beneficial for all types of search engines
– better support for specialized search engines
– network overhead due to crawler mobility is small
Mobile crawling solves the scaling problems of the traditional crawling approach by allowing remote operations to be performed on the crawled data.
Approach provides a base for smart Web crawling.
04/17/98 [email protected] 24
Future Work
• Security– crawler identification based on digital signatures
– restrict crawler execution to positive identified crawlers
– implement virtual machine as a secure sandbox
• Crawler mobility support– integrate virtual machine into web servers
• Mobile crawling algorithms– optimize crawling algorithms with crawler mobility in
mind (e.g. crawler communication)