a topic specific web crawler and wie*: an automatic web information extraction technique using hps...
Post on 20-Dec-2015
215 views
TRANSCRIPT
A Topic Specific Web Crawler and WIE*: An Automatic Web Information
Extraction Technique using HPS Algorithm
2002.09.04Dongwon Lee
Database Systems Lab.
2/20
Contents
• Introduction• Crawler• Problem of Web Crawler• URL Ordering• Focused Crawler• Strategy for URL Ordering • Ordering Model• WIE* • Conclusion
3/20
Introduction(1)
• Search Engine• Gathering web resource• Shortage of Search Engine
– Large Answer Set– Low Precision– Destroying the Hypertext Structures of
Matching Hyper documents– General Concept Queries
• Topic driven web-crawling
4/20
Introduction(2)
• It can reduce search space to improve the efficiency of web search engine.
• It can be applied to special purpose search engine.– Ex) Medical information retrieval,
Travel information retrieval, Biology information retrieval
5/20
Web Crawler(1)
• Program that automatically traverse the Web via hyperlinks embedded in hypertext, news group listings, directory structures or database schemas.
• Gather resources from the Web• To ensure an index is kept as
up to date as possible• To achieve the broadest
possible coverage of the Web.
6/20
Web Crawler(2)
• Retrieving Module• Processing Module• Formatting Module• URL Listing Module• The order of traversing
– Breadth-first– Depth-first– Better pages first
• How frequently the index is updated– Stars in the sky view
Word Wide Web
Database
Retrieving Module
Processing Module
Formatting Module
URL Listing Module
7/20
Problem of Web Crawler
• How frequently the index is updated– How old is an index to a Web page
• Varies a lot: One day to two months• Stars in the sky view
– Percentage of invalid links: 2-9%
• Not be able to visit every possible page– Must periodically revisit pages – Storage capacity– Network bandwidth
• Shortage of Information Retrieval Systems– Too large answer set– Low precision– Etc.
8/20
URL Ordering
• BFS, DFS• URL Ordering
– Document similarity measure– PageRank
• Inlinks/Outlinks•
– URL measure• “.com”, “home”, “www”
• J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through url ordering. In Proc. 7th Intl. World Wide Web Conference, Brisbane, Australia, 1998
n
i
TiCTiPRddAPR1
))(/)(()1()(
9/20
Focused Crawler• Text Categorization Method
– Naïve Bayes Algorithm• Distiller
– Authority– Hub
• Soumen Chakrabarti, Martin van den Berg, Byron Dom, Focused crawling: a new approach to topic-specific Web resource discovery , 8th World Wide Web Conference
10/20
Strategy for URL Ordering
• Better page if it has many outgoing links
• Better page if its contents is more relevant to a certain topics
• Important rate of pages that passed influence on important rate of next page
• A Web Crawler using Hyperlink structure and Hypertext Categorization Method,The 17th KIPS Spring conference
11/20
Ordering Model(1)
•
ikki
kk
kkk
kii
ptopfromdepthpd
classtoinclassifiedispofyprobabilitpC
pfromlinkedisthatdocumentppL
pbypassedisthatdocumentppP
ii
)(
)(
}|{
}|{
ki
i
ii
Pp i
ki
kk
poflinksoutgoingof
pCpIPL
linksoutgoingofMax
linksoutgoingofpCpIPL
#
))()(((
)#
#()()(
k d(pi) P1
P2 P3 P4
P5 P4 P5 P6
P1
P2 P3 P4
P5 P4 P5 P6
12/20
Ordering Model(2)
WWW
RetrievingModule
ClassifierModule
LinkModule
FilterModule
URL ListingModule
Indexer
Index
WWW
RetrievingModule
ClassifierModule
LinkModule
FilterModule
URL ListingModule
Indexer
Index
13/20
Image Crawler
• Filtering Useless Image– Bullet, Background image, …
• Find Semantic for indexing– Use document which contains images
• Generate metadata automatically– Mpeg-7
14/20
WIE*
• Automatic web page traversal and content extraction
• HPS (hyperlink prioritization search)• Mechanism
– Identify a URL– Travels Web from a provided url– Extract and collect information pieces by pa
ragraph
15/20
WIE*- HPS(1)
• Notation– D={d|d is a web page}– P={p|p is a paragraph in d}– W={w|w is a word in p}– l(d,p,w) is a hyperlink with a (d,p,w)
status– L(D,P,W) is a hyperlink category to
which l(d,p,w)
16/20
WIE*- HPS(2)
• Notation– D1={d|d is an element of D and contains search
keyword(s)}
– D0={d|d is an element of D and not the element of D1}
– P1={d|d is an element of P and contains search keyword(s)}
– P0={d|d is an element of P and not the element of P1 }
– W1={d|d is an element of W and contains search keyword(s)}
– W0={d|d is an element of W and not the element of W1 }
17/20
WIE*-HPS(3)• Link category
• LC7 > LC6 > LC4 > LC0
Link category
Page Paragraph Link
LC0 D0 P0 W0
LC1 D0 P0 W1
LC2 D0 P1 W0
LC3 D0 P1 W1
LC4 D1 P0 W0
LC5 D1 P0 W1
LC6 D1 P1 W0
LC7 D1 P1 W1
18/20
WIE*-HPS(4)
• Trend– Global trend
• TG=
– Recent trend
• TR=
visitsentire in the webpagesof/#
visitsentire in the webpagesmatched of #
visitsentire in the webpagesof#
itsrecent vis in the webpagesmatched of #
19/20
WIE*-HPS(5)
• Termination Scheme– Comparing Global trend and Recent
trend– No more hyperlink– By the user’s decision
• Information Extraction– Extract paragraphs containing
keyword
20/20
Further work & Discussion
• Intelligent WIE*– Learned WIE*
• Fast learning mechanism• Reinforcement learning• Hypertext categorization
– Advanced Information Extraction technique– Personalization technique
• Using Link Information– Hub and Authority
• Client Side Light Application(?)– Ex) Plug in Web browser
21/20
Information Retrieval System
Query and
Feedback
User Interface
Crawling
IndexingQuery
Processing
Document Repository
Search Engine
Knowledge Base
Learning
Learning
Inference Engine
Ad Hoc Information
Large Text (Multimedia) Database Tech.Data(Text) Mining Tech.
Query and
Feedback
User Interface
Crawling
IndexingQuery
Processing
Document Repository
Search Engine
Knowledge Base
Learning
Learning
Inference Engine
Ad Hoc Information
Large Text (Multimedia) Database Tech.Data(Text) Mining Tech.